NihAV architecture overview

Here I'd try to present how NihAV works with at least significant components covered. I'm going to describe the work-flow of transcoding since this process involves all steps (demuxing, decoding, encoding and muxing it all again).

demuxing
decoding
frame structure
encoding
muxing
miscellaneous bits
conclusion

Demuxing

Demuxing is (obviously) done by feeding input to the demuxer and obtaining packets from it. But in order to do that first we need to open input and select a proper demuxer.

Opening input is done by opening a reader in ByteIO format: there are readers like nihav_core::io::byteio::FileReader for Rust File structure input or byteio::MemoryReader for reading data from byte array. You can add your own reader if needed (or writer, they are also based on ByteIO). Resulting reader is used by ByteReader to read data in various formats (byte, byte buffer, integers and floats of different size and endianness).

Now we need to select a demuxer. Usually this is done by selecting a proper demuxer from a list of registered demuxers (nihav_core::codecs::RegisteredDemuxers). This list is filled by functions usually called something_register_all_demuxers() provided by crate supporting some formats or you can add some demuxers by yourself. You can also use nihav_allstuff::nihav_register_all_demuxers() to obtain all demuxers provided by NihAV crates.

Demuxers are selected by name e.g. demux_list.find_demuxer("avi") should return you an object of DemuxerCreator type that can be used to create a demuxer instance for demuxing AVI. This is done by invoking nihav_core::demuxers::create_demuxer(dmx_creator, input_reader).

Side note: Actually in this case you're using a structure wrapping actual demuxing interface plus some auxiliary structures required by it but nothing prevents you from using them directly.

In case you don't know which container format is proper for the input you can use nihav_registry::detect::detect_format() to do that. It takes input file name and input reader and tries to determine format by file contents and file extension if that fails. The function also returns DetectionScore to tell you what was the case.

After demuxer is created you can tell it to seek to a certain time, get a list of available streams, or start getting packets—all in no particular order. Demuxed packets contain reference to the stream they belong to so you should be able to determine which decoder to use.

N.B.: There is also an interface for handling raw data streams that require parsing to be formed into NAPackets (like MPEG TS) which is done by the following chain: RawDemuxer outputs NARawData for each of the streams that can be used with NAPacketiser instance to produce NAPacket that is fed to the decoder.

And this leads us to…

Decoding

Decoders are created in similar fashion as demuxers: you create a RegisteredDecoders list, fill it e.g. with nihav_allstuff::nihav_register_all_decoders(), select a proper decoder by name and create it. The main differences are that you have a function to create decoder instead of an object and no input reader required but instead you need to create NADecoderSupport which is used by decoders (but more about it later). In case you wonder why there is a difference in creation approaches, this is caused by Rust ownership approach: (de)muxers need a reference to the input/output stream while codecs just take per-frame input and produce output. It can be done differently but then I'll have to redo the design of ByteIO-based readers/writers and that is annoying.

Codec name is stored in NAStream codec information along with other information required to initialise the decoder. In NihAV codecs are always identified by short strings like "realvideo6". All known codecs should be listed in nihav_registry::register so that you can get their full name and capabilities by the identifier.

Now back to NADecoderSupport. This structure contains frame pools that may be used by video decoders instead of allocating a new output frame every time. And consumer may want to reserve some frames in that pool for its own consumption so e.g. decoder uses three frames (one for decoding and two for references) and consumer may add another dozen for display queue.

Another issue is that decoder is expected to output frames in the same order as they are fed to the decoder. Obviously this conflicts with codecs that require frame reordering and that's why we have nihav_core::reorder module with common FrameReorderer trait and several reorderers (null and for codecs with B-frames). How to tell when to create which one? As mentioned earlier, nihav_registry::register has CodecDescription for each supported codec including the property that codec requires frame reordering (or use IPBReorderer for a generic case).

N.B.: recently an interface for multi-threaded decoding was introduced. While NADecoderMT was modelled after NADecoder, it works asynchronously and thus instead of "packet in–frame out" model it expects the caller to queue and retrieve frames in separate calls (with an option to poll whether it makes sense to invoke it or to wait a bit more before another attempt). Additionally such decoders require a special frame reorderer called MTFrameReorderer in order to arrange received frames in the right order.

Frame structure

While speaking about decoder output it is worth to mention how raw frame is represented in NihAV.

In this case design is dictated a lot by Rust limitations and type system. Instead of having simple byte buffer with metadata to interpret its contents, it is implemented as a variant that contain various frame types: video frame with packed data, video frame with 16-bit samples, video frame with 32-bit samples, audio frame with packed data, audio frame with 16-bit samples, audio frame with 32-bit integer or floating-point samples, and some other. This means that internally samples are stored in native endianness which simplifies things in many cases (since you cannot misinterpret data). But in case you need to handle it as raw data in whatever order provided, NABufferType::VideoPacked(buf) and NABufferType::AudioPacked(buf) are still there for this particular purpose.

In either case data is stored as single buffer with offsets pointing to start of each component, both for video and audio data. Audio can be stored in either interleaved or planar form.

Another feature is that video frames are stored in whatever order they are decoded and metadata contains ‘flip’ flag to indicate that image should be flipped. The main reason is that in Rust you can't do arithmetic operations of different types and you use a special usize type for indexing, so having a signed difference between lines would lead to annoying “convert index to signed, add stride, check if it's still in range, convert back to unsigned” in every place (or implementing a special indexing trait which is differently annoying). In result picture is stored in its native orientation and flipped only during output or format conversion.

And finally there is another thing worth mentioning related to frames. In certain cases decoder may output frame with NABufferType::None and type set to FrameType::Skip. This obviously means that the current frame is the same as the previous frame and no data is transmitted for it. On one hand this complicates frame handling a bit but on the other hand it gives a bit more information to the encoder (which may also have a special way to code repeated frames) or player (which now does not have to show a frame).

Having said that, let's move further to…

Encoding

Creating encoders is working similar to decoders: just select a proper encoder by name from the RegisteredEncoders list filled by e.g. nihav_register_all_encoders(). But then it gets somewhat different.

First of all, encoders are more picky than decoders: while decoder is supposed to decode input stream as long as it's valid, encoders may have limitations on various parameters like sample rate and number of channels for audio encoders (many audio codecs are limited to mono/stereo and 32kHz/44.1kHz/48kHz only or something similar) and picture dimensions, pixel format and frame rate for video codecs (e.g. some encoders work only on 24-bit RGB, other accept only paletted format, others can handle several different pixel formats; some formats require input picture to be one of several fixed size, other simply require image dimensions to be multiple of 16 or 4; IIRC MPEG-1 Video codes frame rate in its headers and imposes a limitation on them as well). This means that before trying to initialise encoder you need to find out what it accepts, hence the need for format negotiation.

Format negotiation is performed by filling EncodeParameters structure and then passing it to encoder.negotiate_format() which may reject it outright or return another instance of EncodeParameters with input parameters adjusted so that the encoder will accept them. You can also feed a default instance of EncodeParameters to it to see at least some of the parameters the encoder accepts. After negotiating is done you can initialise encoder and it will create a stream to which all encoded packets from this encoder belong.

Another essential feature is that encoders in NihAV are expected to work asynchronously. That means you queue up data for them and at some indefinite amount of time they may produce some packets. Yes, this is different from decoders because audio decoders may produce output of varying length if needed and video decoders work on “frame in-frame out” basis unless you have to reorder frames and we handle reordering outside the decoders, so this is not an issue. On the other hand encoders may need to have a look-ahead of several frames before they can encode data efficiently. One example would be a B-frame coding decision for video (or scene change decision), another example is AAC-LC where if you have a frame with transients then it's better to code it with eight short windows and preceding frame should be coded as “long-to-short transition” frame—and this requires at least one frame of lookahead.

Of course if you want to tell encoder that it should encode all frames it had received (the most common case is that there are no frames left to encode) so far you can invoke encoder.flush() in order to do exactly that.

This results in encoding loop looking like this:

while let Some(frame) = frames.next() {
  encoder.encode(frame)?;
  while let Ok(enc_pkt) = encoder.get_packet()? {
    muxer.mux_frame(enc_pkt)?;
  }
}
encoder.flush()?;
while let Ok(enc_pkt) = encoder.get_packet()? {
  muxer.mux_frame(enc_pkt)?;
}

Muxing

Creating a muxer is not that different from demuxer (there's even a function in nihav_registry to guess target container format from the file name), the main difference is that you should provide all output streams to the muxer on creation and it might not accept some of them. In order to deal with that MuxerCreator has a special function get_capabilities() that returns stream configuration that muxer accepts: merely single audio or video stream (maybe even with a fixed codec), single audio and video stream (again, maybe with a fixed codecs), audio- or video-only muxer or muxer that accepts multiple streams of various kinds.

After muxer is created you're supposed to send packets to it with muxer.mux_frame(), call flush() if needed (it is useful for the cases when muxer groups packets into larger blocks instead of writing them in the same order as they come and you want to ensure they're written and a new block may start) and end() to write file trailer, patch object sizes and such (and since this operation may fail it's better to make it explicit instead of adding it to muxer destructor where it'll end silently without informing you).

Miscellaneous bits

I guess it's worth describing how a codec test system works. Since Rust and Cargo allow per-module test system, I've developed a set of test functions for decoders and even encoders. In order to test decoder functionality you need to register crate-specific decoders and demuxer for test files and invoke test_decoding() with demuxer and codec names, test file name plus some other parameters like decoding limit and expected test result. Test results can be one of three types: decoding finishes without errors, calculated MD5 hashes for frame data correspond to the list or the whole decoded data MD5 matches the reference. Since frame samples are native-endian and I calculate their hash using samples and not raw bytes, the results should be the same on big-endian platforms as well—and I don't have to create output files. In addition to that I have functions to dump decoded output into wave file or image sequence so it's easier to debug the decoder (and in this case you just need to recompile a single crate and re-run the test instead of compiling a whole application and running it with certain parameters). And of course there's a mode to generate MD5 hashes for decoded output that can be used as the reference later.

The similar system exists for muxers and encoders as well with output both to file and to MD5 hash (i.e. in this case muxer writes to memory which is later used to calculate the hash). Demuxers do not need it because testing demuxers is as easy as opening file and demuxing packets in a loop.

Another thing used to fine-tune the work of codecs and (de)muxers is object-specific options. It is a simple system with NAOptionHandler trait that can either report the list of supported options, set option values or return a value currently set for some option. As a test I've added option to some RealVideo decoder (frame skip mode) and to Cinepak encoder (to set the number of strips per frame, the distance between keyframes and quantisation method) and it worked there as expected.

And finally maybe I should mention that NihAV uses custom reference-counted wrappers for buffers. Standard Rust types do not work well in case of frame pool where pool always has a reference to the buffer and decoder has another reference to the same frame and wants to modify it. So instead of trying to find which combination of Arc, Cell and RwLock works I simply NIHed something like Arc but with relaxed checks. Of course I fully realize this is not the proper implementation and it may cause hard to debug problems later when (or if) I finally get to multi-threaded decoders but this is a risk I'm willing to take and thus most of the structures use Arc for reference-counting purposes but buffers in packets and frames use NABufferRef instead.

Conclusion

I hope you are not too confused by this overview and I also hope it was able to demonstrate that NihAV as framework can handle advanced multimedia features and scenarios. After all it has decoders for rip-offs from H.263 to H.265 (with RealMedia being present in every category of course) and the only external dependency so far is Rust standard library. And nihav-encoder can be used to transcode input format into some other format using native encoders for audio and video.

If you think that some concepts and design features presented here were strange then you should remember that it is exactly Not-Invented-Here syndrome is what gave the project its name and one of the design principles. And if you've found here some food for thought or some interesting ideas worth discussing further then its goal has been accomplished.

Back to the main page