This article is 3600 words in total, and it takes about 5 minutes to read

Hima Practice: The Way of Audio Editing in the Model Era - Cloud Editing Word Editing

Maslow's hierarchy of needs for creators

According to incomplete statistics, the level of Maslow's demand for audio editing tools by creators is roughly divided into the following stages:

1. Basic editing requirements:

Recording, local upload, waveform drawing, multi-track, audio cutting, audio drag and drop, playback audition, composite export

These requirements are a must-have skill for an audio editing tool that allows creators to produce audiobooks as small as a podcast episode or as large as an entire series. Since this is a must-have skill, most of the editing tools on the market cover these areas.

2. Efficiency improvement needs:

Audio marking, audio multi-selection, reverberation, audio tuning, cloud collaboration, and more

This kind of demand is familiar to creators who often do high-intensity editing, such as marking can help creators mark sound highlight clips, quickly locate dozens of audio clips, and multi-selection can help creators quickly match tracks during the editing process. Different editing tools have their own advantages for the implementation and experience of such functions, such as marking, which was launched and promoted by a well-known client-side editing tool, and then used by everyone to learn from it, while cloud synchronization and collaboration have natural advantages for online editing tools.

3. The needs of the era of large models:

TTS, AI soundtrack, AI packaging, text editing, one-click filming

This is an imaginative stage, and the variety of algorithmic models brings endless possibilities to audio editing tools, such as TTS provides the ability to convert text into sound, NLP and Wensheng diagram provide the ability to generate title outlines and covers for sound, and ASR provides the ability to cut audio through visual text.

4. Ultimate Needs:

"I don't have audio material, but I want to generate a podcast"

This is what I need when I meet the deadline but haven't started working on it yet...

Cloud clips

As the main audio editing tool of Xima creator ecology, Cloud Editing has taken "creating a one-stop audio editing solution for creators" as its mission at the beginning of its birth, and has gone further and further on the road of audio editing and anchor creation tools.

Given that we are in and will probably be in the model era for a long time, we hope that the series of intelligent solutions of cloud editing will be recognized by everyone, and take a little thought from the use of functions and code implementation, and add some useless knowledge by the way.

In this issue, we will first introduce the text clips of cloud clips.

Text clip application scenarios

There are many pain points in the audio editing process, one of which is that compared to text and video, the sound is invisible, the sound waveform is not like text or video frames can be quickly and effectively previewed, the process of cutting audio is an infinite loop of editing, audition, and re-editing, as long as the audio is long enough or the recording quality is poor enough, then the "clip-audition-re-editing" cycle has no end.

Is there a way to "visualize" the audio cuts?

•You can quickly click to locate a sentence or a word in the audio, instead of repeatedly locating and auditioning on the audio based on feeling

• Clips in a sound can be searched by text instead of relying on auditions

• You can clear saliva words and air vents with one click, instead of auditioning to find and then cut out

• Instead of cutting the wrong thing at the wrong start, then at the end of the mistake, then at the end of the mistake, then delete the mistake, and finally move the misplaced audio forward

The answer is yes, and that's text clipping.

Use of features

Text clips cover word clips, search, quick selection, mouth fetish detection and removal, air port detection and removal, marking and other functions, it is relatively simple to use, I won't introduce much here, if you are interested, you can experience it in the cloud clip, this Part we skip.

Functional implementation

Model capabilities

The prerequisite for the implementation of text editing is the ASR capability, which can convert human speech content into the corresponding text, such as the following data structure, which is generated based on the ASR capability:

Engineering

With the ASR foundation in place, the next step is engineering. Text editing is essentially a new parallel text editing module on the basis of traditional audio editing, and linking the audio and text modules, so that we need to implement a text module first, which is responsible for the following matters:

• Render text for sound: Make it easy to see the audio content at a glance when editing

•Text support click to locate and highlight: convenient to quickly locate the audio playback cursor

•Text support search highlight: easy to quickly find and locate

•Text can be dragged and dropped and a deletion box pops up: it is convenient to delete the sound content corresponding to the text with one click

•Sentence support highlighting: It is convenient for users to know where the current playback is and what the content is in real time when playing audio

• Paragraph support for quick selection: convenient to quickly delete content for continuous sounds

•Oral fetish word recognition: It is convenient to mark the oral fetish words within the full text according to the list of oral fetish words given by the user, and delete them later

•Air port identification: It is convenient to insert the air port identification at the corresponding position according to the time point of the detected air port, and then delete it later

•Marker recognition: It is convenient to identify the marked mark and insert it into the corresponding position in the text

It may seem complicated, but it's not simple at all.

For example, when you click on a word in a large paragraph of text, the word should be highlighted, and at the same time, move the playback cursor to the position of the word corresponding to the sound, you will immediately think of wrapping the word with an element and highlighting it through css, as for the time corresponding to the word, you can use the Selection API to get the index of the word in the text, and then go to the ASR data through the index to get its corresponding time in the audio.

As soon as you start to find that whether it is a wrapping of additional elements or a cross-segment drag scene, the effect of the Selection API to obtain the index will be affected, and because the Chinese version is constantly deleted during the editing process, the accuracy of the index cannot be guaranteed.

Architectural design

After a long 5 minutes of thinking, we realized that the smallest DOM unit or component unit should fall on the word.

delamination

word level

Each word can be abstracted as a Word component, saving its own start and end times in the state and injecting it into the data-attribute of the dom. Word can calculate whether it is the chosen word with the positioning highlighted according to the time point of the current playback cursor, calculate whether it is dragged according to the time range of dragging, and know whether it is part of the mouth fetish according to the logo in props.

Sentence level

N Word components can form a Sentence component, which calculates whether it is the selected sentence that needs to be highlighted in the playback according to the time point of the current playback cursor, and also detects whether the current sentence contains a verbal fetish word or search term, and if so, tells the corresponding Word component that it is part of a verbal fetish word or search term.

Segment level

N Sentence components can be abstracted into a Paragraph component, as above. The only difference is that Paragraph carries a part of the performance optimization tasks, which will be discussed later.

In this way, the text module becomes a three-level structure of paragraph-sentence-word, on which we add some general capabilities:

For example, drag and select text: through the Selection API or take the start time of the first word in the range and the end time of the last word, that is, the text/audio time range of the dragged text.

And deal with some boundary cases:

For example, if the selected content spans the text corresponding to two audios, you need to make up the end time of the previous audio and the start time of the next audio to form two intervals to delete the corresponding two audios

Another example is to start dragging from the text, and then drag out the browser window, at this time, the end word cannot be detected, so it is necessary to find the end word in the Paragraph by the beginning word to form an effective closed time range.

passage

After talking about the text module, let's talk about the channel. From the gif of the text clip above or the actual experience, you can find that there is a linkage between the audio and text sections, and these linkages are transmitted by the channel. Channels are responsible for sharing methods (e.g., deleting, locating, etc.) and passing data (e.g., cursor time, dragging ranges, etc.), as well as computational work (e.g., automatically calculating the text corresponding to the remaining audio when deleting an audio). A channel contains a series of stores and utils to synchronize data.

For example, when you click on the audio timeline or text, the cursor of the other party will also be positioned to the corresponding point in time.

For example, when you delete audio or text, the other party's content will also be deleted.

For example, when dragging and selecting text, the corresponding interval on the audio will also be highlighted by dragging.

For example, the audio or text is marked and perfectly synchronized on the other side.

For example, as the playback cursor continues to move, off-screen text will be pulled into the screen, oh, this has little to do with the channel...

In short, the sound panel, the text panel, and the channel together form a complete text editing module, let's take a look at the final architecture diagram:

Performance optimization

Performance issues

As you may have noticed, the word-minimum pattern produces a lot of components and DOM nodes, which is a huge performance load. In fact, the early cloud editing did not support word-by-word editing, but took sentences as the smallest unit, and the performance problem has been exposed in long audio/long text scenarios, and the number of components has expanded dramatically after the word-based editing is launched, making this problem more directly exposed, so that operations such as clicking on the text to locate, drag, select, and delete are accompanied by obvious lag. (The picture is the experience of early sentence editing under long audio, and the GIF display is a bit problematic, only the first few frames will be displayed, and the complete process will not be seen).

Optimize the solution

There are two ways to solve the problem:

1. Reduce the number of DOMs

The Paragraph component detects whether it is in the window through the IntersectionObserver, enables the full function when it is in the window, and downgrades many Sentences and Words inside to plain text when it is outside the window (the style is the same as the real component), and realizes the highlighting of the fetish words and search terms in a low-performance way, so as to ensure the statistical display of the number of search terms and fetish words and the ability to jump quickly.

2. Optimize drag and select detection

In the early version, in order to solve the problems of dragging and selecting cross-text (dragging and selecting the text corresponding to different audios), each smallest unit was bound to determine whether it was dragged by binding events when dragging, and the dragged person went to the store to update the start and end time of the drag-and-drop range in turn, and then judged whether they wanted to highlight according to the drag-and-drop time range passed down from the store, which had the problems of too many executions and invalid rendering of components, and later changed to the time when the mouse was raised to detect the first and last words of the selection to establish the selection time, and separately deal with the problems of cross-text dragging, and then go to the store to set the selection time to trigger the drag and highlight of the corresponding text。

Optimize performance

(The gif display is a bit problematic, only the first few frames will be displayed, and the full process will not be seen)

Taking a 3-hour audio with about 4.5w words as an example, the comparison of the time taken before and after optimization is as follows

Legacy	new edition
rendering	2000	600
Drag and drop	3000	400
Delete	2000	500
Text positioning	2000	200
Audio positioning	1000	100
Undo & Redo	1500	100

epilogue

Text editing is an attempt by cloud editing to improve the efficiency of audio editing, and we believe that with the joint efforts of algorithms and engineering partners, the highest level of creator Maslow's demand level will eventually arrive.

作者:zhenjiang

Source-WeChat public account: Himalaya technical team

Source: https://mp.weixin.qq.com/s/wWnMwYKBPaEk_5CJsTAR5A

Hima Practice: The Way of Audio Editing in the Model Era - Cloud Editing Word Editing