Model news / TelkNet Audio

Four-stem audio separation white paper: why Huge-SCNet fits the MSS target

News date: 2026-06-30
Current four-stem model: Huge-SCNet-4stems V1.2
Standard output: vocals / drums / bass / other
Topic: MSS architecture, Mel-Band RoFormer limits, SCNet tradeoffs

Summary

AI music source separation has moved past the simple question of whether a vocal can be pulled out of a mix. For remixing, sampling, practice, Music-to-MIDI, spatial audio, and material cleanup, the useful question is whether the result remains editable. TelkNet's four-stem tool follows that stricter target: split a full mix into vocals, drums, bass, and other, rather than only producing vocal and backing-track files.

This article organizes the recent discussion around MVSep Ensemble, SCNet XL IHF, Mel-Band RoFormer, BS-RoFormer, and TelkNet's current four-stem tool into one technical story. A model that is strong for vocals is not automatically best for four stems. A platform ensemble can score highly without being a downloadable standalone model. TelkNet currently uses Huge-SCNet-4stems V1.2 because standard four-stem work needs a balanced treatment of vocals, drums, bass, and the remaining instruments.

MVSep public algorithm pages and leaderboards are common reference points in music source separation. This article treats them as evidence sources, not as proof that an ensemble workflow is one standalone model.

Four stems are not another name for backing tracks

Standard four-stem separation usually means vocals, drums, bass, and other. Vocals are the singing track, drums are the drum kit, bass is the bass part, and other holds the remaining instruments: guitar, piano, strings, synths, brass, effects, and more. The backing track heard after muting vocals is really drums, bass, and other summed together.

This target comes from the main datasets and evaluation tradition in music source separation. MUSDB18 and MUSDB18-HQ have long shaped model training and evaluation around these four stems. When the community says 4-stem, it usually means this output target.

The distinction changes the tool choice. If the goal is a clean vocal or a no-vocal instrumental, a two-stem vocal separator is more direct. If the goal is drum practice, bass analysis, sampling, remixing, or editable material cleanup, four stems are the more useful target. Other is not a weaker backing track; it is the hardest instrument basket in the system.

Four stages from U-Net to SCNet

Music source separation is the attempt to recover several sources from one mixed waveform across time and frequency. Earlier models could produce listenable outputs, but the artifacts were familiar: vocal leakage, drum bleed, muddy bass, smeared high frequencies, or a torn phase quality in dense arrangements.

Period	Representative path	Progress	Remaining issue
2020-2021	VR Architecture / U-Net	Treated spectrograms like images and made basic vocal/accompaniment separation practical.	Local convolution windows struggled with long musical context and phase relationships.
2022-2023	HTDemucs (Meta AI) / MDX-Net	Combined waveform and spectrogram modeling, improving reconstruction quality and stability.	Dense climaxes, heavy reverb, and complex transients could still produce artifacts.
2023-2024	BS-RoFormer / Mel-Band RoFormer (SAMI, ByteDance)	Used band splitting, RoPE, and stronger time-frequency modeling for vocals, harmonies, slides, and sustain.	A frequency allocation that favors vocals is not automatically best for all four-stem instruments.
2025-2026	SCNet / Huge-SCNet (Tsinghua SIGS, Skywork AI, Peng Cheng Lab, CUHK, and others)	Used subband modeling and sparse compression to focus on information-rich time-frequency regions.	The target remains standard four-stem separation, not six-stem output or a platform ensemble.

This history explains why model selection cannot be reduced to vocal quality. Four-stem separation must also preserve drum transients, bass fundamentals and harmonics, vocal boundaries, and the high-frequency details inside other.

Where Mel-Band RoFormer is strong

Mel-Band RoFormer was proposed by ByteDance's Speech, Audio, and Music Intelligence (SAMI) team. The paper authors are Ju-Chiang Wang, Wei-Tsung Lu, and Minz Won, with the title page affiliation listed as SAMI, ByteDance. The model is attractive because it shapes frequency bands closer to human hearing. The ear is more sensitive to low and mid frequencies, where vocal fundamentals, consonants, breath, and many pitch cues live. Mel-band projection gives this range a finer representation, helping the model follow leads, backing vocals, slides, vibrato, and reverb tails.

The RoFormer family also uses rotary position encoding, or RoPE, which helps describe relative positions across time and frequency. A slide is not a single point, and a harmony is not a static image. They are moving frequency traces. RoPE and axial attention help preserve that continuity across longer passages.

That is why Mel-Band RoFormer is a strong route for vocal/accompaniment, lead/backing-vocal, and clean vocal extraction work. The problem is that four-stem separation is not only about vocals. It also has to separate drums, bass, and other, and other often contains the most difficult high-frequency instrumental detail.

Why vocal strength does not equal four-stem strength

The short answer is task fit. If the target is vocal/accompaniment separation, Mel-Band RoFormer remains a very strong vocal-oriented route. If the target is a clean, editable standard four-stem output, TelkNet's current Huge-SCNet-4stems V1.2 choice is the better fit for the product goal.

The numbers should not be mixed across leaderboards. The Mel-Band RoFormer paper reports Mel-RoFormer improvements over BS-RoFormer on MUSDB18HQ, while the public MVSep Mel Band Roformer page is a vocal/instrumental algorithm page and notes that the original competition model was not directly released. For that reason, this article does not treat a vocal-only number as a directly comparable four-stem row.

Public four-stem comparison	Vocals	Drums	Bass	Other	How to read it
Huge-SCNet-4stems V1.2	9.6073 dB	11.7422 dB	12.0639 dB	6.6485 dB	Current TelkNet four-stem model; the drums, bass, and other balance fits editable-stem use.
BS Roformer 4-stem	9.19 dB	11.29 dB	11.08 dB	5.96 dB	A public four-stem RoFormer-family comparison row; these numbers should not be relabeled as Mel-Band RoFormer.

Mel-band splitting gives more modeling resolution to the low and mid range, which helps vocals. But in a four-stem task, the 6 kHz to 20 kHz region also matters. Cymbal air, guitar strum harmonics, piano attacks and tails, brass brightness, and synth sheen often live there.

If the high-frequency region is grouped too coarsely, the model can blur these details together. The failure may not sound like a bad vocal. It may sound like a muddy other stem, smeared cymbals, or unclear boundaries between guitar and piano. For a listener who only wants an a cappella, that may not be the main defect. For someone who needs editable stems, it matters immediately.

This is the common confusion in model discussions. Mel-Band RoFormer can be a strong vocal model and can perform well in some four-stem settings, but that does not automatically make it the best answer for standard four-stem separation. The hard part is separating several source types at the same time, not only removing the most obvious vocal layer.

Why SCNet fits this four-stem target

SCNet takes a more information-oriented view of the spectrum. The paper authors include Weinan Tong, Jiaxu Zhu, Jun Chen, Shiyin Kang, Tao Jiang, Yang Li, Zhiyong Wu, and Helen Meng; its title page lists Shenzhen International Graduate School, Tsinghua University, Skywork AI PTE. LTD., Peng Cheng Lab, and The Chinese University of Hong Kong among the affiliations. Architecturally, SCNet explicitly splits the mixture spectrogram into subbands and uses sparse compression to handle the different information density of those bands. In simpler terms, it does not put nearly all capacity into the vocal range. It pushes the model to focus on time-frequency regions that actually carry useful instrument signal.

That matters for four stems. Bass sits low but carries important harmonics. Drums combine very short transients with cymbals and room sound. Other may contain guitar, piano, strings, synths, and effects at the same time. Subband modeling gives different frequency regions more targeted treatment, while sparse compression keeps attention away from blank or low-information regions.

The SCNet paper describes the full system as an audio encoder, a separation network based on dual-path RNN, and an audio decoder. The encoder maps the mixture into frequency-domain subband representations, the separation network models structure across time and frequency, and the decoder reconstructs the target stems.

Huge-SCNet-4stems V1.2 belongs to this route. TelkNet uses it for the four-stem tool not because it beats every model on every stem, but because its balance matches the product target: users need vocals, drums, bass, and other as four files they can continue to edit.

MVSep Ensemble is not one model

MVSep Ensemble 2025.06.30 represents a platform-level multi-model workflow. The public algorithm page lists more than one model family. The vocal path may use UVR-MDX-NET, Demucs, MDX23C, VitLarge23, BS Roformer, Mel Roformer, SCNet XL, and other routes, while bass, drums, and other can call different Demucs or related models.

The strength of this kind of ensemble comes from specialization: assign vocal-heavy material to vocal specialists, low-frequency and transient-heavy material to stronger instrumental separators, then fuse the results. It can be a quality-ceiling reference, but it is not one checkpoint and not a standalone model a user can reproduce by downloading a single file.

For that reason, discussion of a downloadable single four-stem model, or of TelkNet's current four-stem model, must keep platform ensembles outside the single-model category. Otherwise the comparison shifts from model architecture to hosted workflow.

Public four-stem comparison

The table below uses public figures already present in TelkNet's tool reference data. The scores help locate each model under the same four-stem target. They should not be read as an absolute verdict for every song, genre, or listening case.

Model	Average SDR	Vocals	Drums	Bass	Other	How to read it
Huge-SCNet-4stems V1.2	10.02	9.6073	11.7422	12.0639	6.6485	Current TelkNet four-stem model in the public reference table.
SCNet XL IHF 4-stem	9.92	9.68	11.58	11.94	6.48	Strong standalone four-stem candidate, especially for drums and bass.
BS Roformer 4-stem	9.38	9.19	11.29	11.08	5.96	Useful four-stem reference for the band-split RoFormer route.
HTDemucs4	9.16	8.24	10.88	11.76	5.74	Mature classic baseline, but not the ceiling for current four-stem quality.

How to read the scores

Higher SDR usually means less distortion against the reference stem. But average SDR cannot replace listening, and it cannot flatten every stem into one answer. One model may be strong on bass and drums while another keeps vocals cleaner. Other is especially difficult because it contains many instruments.

The value of Huge-SCNet-4stems V1.2 for TelkNet is its clear four-stem product boundary: one public model name, standard four-stem output, and traceable public reference figures. SCNet XL IHF remains an important comparison. MVSep Ensemble remains a quality-ceiling reference. They answer different questions.

Users should start with the target. Use vocal separation for the cleanest vocal. Use four-stem separation for editable vocals, drums, bass, and other. Use six-stem separation when guitar and piano need their own files.

Four stems and six stems: separate technical promises

Six-stem separation pushes other further into guitar, piano, and remaining other. It is more sensitive to guitar strums, piano attacks, keyboard texture, and high-frequency harmonics. TelkNet's six-stem tool uses BS Roformer SW 6-stem, which is a different model and a different product promise from Huge-SCNet-4stems V1.2.

That is why public copy should not collapse several claims into one sentence: RoFormer is strong, Mel-Band is strong for vocals, SCNet is balanced for four stems, and MVSep ensembles can score highly. Each statement answers a different question about architecture, output target, public availability, workflow form, or product contract.

Conclusion: choose the architecture, not the buzzword

Four-stem separation is not a renamed backing track and not a place to paste the most popular vocal model onto every instrument. It asks the model to balance vocals, drums, bass, and other, especially drum transients, bass lows, instrumental high harmonics, and the mixed instrument basket inside other.

TelkNet's current choice of Huge-SCNet-4stems V1.2 is a tradeoff around that standard four-stem target. Mel-Band RoFormer remains important for understanding vocal separation, and MVSep Ensemble remains useful as a quality reference. In TelkNet's four-stem tool, the user-facing promise is four editable stems with clear boundaries.

Open the TelkNet four-stem separation tool

Four-stem audio separation white paper: why Huge-SCNet fits the MSS target

Technical summary

Reference materials

Four-stem audio separation white paper: why Huge-SCNet fits the MSS target

Four stems are not another name for backing tracks

Four stages from U-Net to SCNet

Where Mel-Band RoFormer is strong

Why vocal strength does not equal four-stem strength

Why SCNet fits this four-stem target

MVSep Ensemble is not one model

Public four-stem comparison

How to read the scores

Four stems and six stems: separate technical promises

Conclusion: choose the architecture, not the buzzword

References