Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. Mimi processes 24 kHz audio, down to a 12.5 Hz ...
Abstract: As high dynamic range video is gaining popularity, video coding solutions able to efficiently provide both low and high dynamic range video, notably with a single bitstream, are increasingly ...
Abstract: Currently, balancing low bitrate coding with speech quality is a highly debated topic in the research community. At very low bitrates, existing methods often fail to maintain speech ...