Gesture interaction, in the end, is it reliable?

That said, have you ever interacted with gestures?

This is not the gesture that if you ring your finger, dozens of strong men will appear behind you.

What we call gesture interaction on the car refers to the vehicle activating various functions of the vehicle by recognizing the specific gestures of the driver and occupants, and may even further replace the various physical or virtual buttons in the car.

However, the truth is that the air gesture in the car has become one of the most controversial forms of interaction we see at present.

Proponents say: "The gesture interaction is so cool! Represents the future!"

Opponents have all sorts of reasons, they feel that gesture interaction is a gimmick, inaccurate, insecure... A thousand words come together in one sentence: unreliable.

It stands to reason that gestures are one of the most intuitive ways to interact with human intuition, and the first way of perception learned in infancy is to grasp and hold. So why is there such a big disagreement? Why are car companies eager and cautious about gesture recognition interactions?

So, let's take a look at whether the gesture interaction on the car is reliable.

Gesture interaction, where is the strength?

At present, the era of physical keys as the king has not completely passed, but we still see that control and interaction have become more and more diversified, voice, touch, gestures, active monitoring... What is constant is that perception and intelligence are the main prerequisites for achieving these interactions.

Just like the mobile phone has undergone the evolution process of "Big Brother - Monochrome Screen Function Machine - Full Keyboard Smart Machine - Touch Screen Smart Machine", with the improvement of hardware performance and network connection capabilities, more diversified interaction methods have also risen.

Yes, since connecting to the Internet, the car has also begun its own period of rapid evolution.

Some people may ask, is the existing form of interaction unreliable? Why are you still doing addition?

Stepping out of the cockpit and spreading out, in the existing interaction mode, each has its own five senses corresponding to humans:

Touch haptics

Speech hearing

Gesture vision

Looking at it this way, the most popular intelligent voice at present also has its limitations. For example, when you are in the physiological passage of running, diving, parachuting, etc., "speaking" and "listening" on either side, the importance of touch and vision is highlighted.

David Rose, an interaction expert at the MIT Media Lab, mentioned in his article "Why Gesture is the Next Big Thing in Design" that after analyzing the findings, people chose gestures over speech or touch for four reasons:

Speed – Gestures are faster than speaking (speech) if a quick response is required.

Distance – If communication is needed across space (distance), it is easier to gesture (visually) than to move your mouth.

Conciseness – Gestures are easier to use if you don't need to say a lot of things at once. The more concise the gesture used to express a certain meaning, the easier it is to remember. For example, the four fingers are folded and the thumb is up to show praise and approval; otherwise, it means contempt and contempt.

When expressiveness is emphasized over accuracy – gestures are great for expressing emotion. The message conveyed by the orchestra conductor, in addition to remake and rhythm, has more meanings, such as sweet (originally Italian "dolco"), emphasis (original Italian "marcato"), confidence, sadness, longing and so on.

The classic gesture of Spock, the representative of the Star Trek series

In the fourth season of The Big Bang Theory, Sheldon poses with Spock's classic gesture, meaning "live forever, prosper."

Another benefit of gesture interaction in the car is that the user can get rid of the shackles of the physical input device and provide users with a larger range of interaction methods that can blur the operation to a certain extent. As the most natural communication instinct, in-car gestures can greatly save attention and resource expenditure for visual channels.

Before fully autonomous driving is realized, the rational use of gesture interaction can effectively reduce the distraction of drivers and passengers, and more valuablely, it can form an important complementary system with interactive methods such as touch and voice.

Let's take an intuitive example.

GeekCar's Smart Cockpit Intelligence Bureau featured a new Mercedes-Benz S-Class in November 2021. The car's MBUX Intelligent Sensing Assistant captures the driver's hand movements for assistive interaction. Supported gestures include, but are not limited to:

The driver can turn the front reading lights on and off by placing his hand under the rearview mirrors;

The driver will wave his hand forward or backward in front of the rearview mirror to control the opening and closing of the sunshade;

In Automotive UI 2019's paper, Effects of Gesture-based Interfaces on Safety in Automotive Applications, the researchers studied the impact on driving safety based on gesture interactions for non-driving tasks such as in-car navigation, temperature, and entertainment.

In an experiment in which a total of 25 people participated, the researchers conducted a comprehensive analysis based on driving data and eye tracker data. The results show that drivers who use gestures may be more capable of responding to unexpected situations. The researchers did not find any direct evidence that there were significant differences in driving performance such as speed, speed variance, lane position changes, and other driving performance between dashboard and gesture interaction.

It should be emphasized that no matter how many advantages the interaction method has, it cannot be separated from the use scenario.

We can't guarantee that the car will always be a private space for one person, nor can we guarantee that the atmosphere in the car will always be suitable for interaction with voice. Let's take a simple but practical example, when the children in the family are struggling to sleep, putting a "big" shape on the crib to sleep soundly, I would rather use my mobile phone to open the smart home app to control the home appliances, rather than risk waking up the human cubs again to interact with the smart speakers and let the appliances work.

As you can see, the evolution of interaction is a particularly interesting process, as is the change in people's attitudes towards interaction.

Tell a little story. About a decade ago, I met an American mobile phone engineer at work. It happened that I was also going to change my mobile phone, so I talked about this topic. Remember the brother's appreciation for the full keyboard of his mobile phone, and the touch screen design of the iPhone was scorned, and the number of discomforts after the mobile phone did not have physical keys.

Interestingly, after we said goodbye, the engineer brother walked some distance, turned to me and shouted: If you really don't know what to choose, the iPhone may be a good choice.

As for what happened later, it goes without saying that today we are all witnesses. Product development and mass acceptance is a long process, and so is the exploration of interactive methods.

The truth is that gesture interaction has officially entered the car, and it has been less than 10 years since then. In the meantime, car companies and suppliers have introduced air gestures into the cockpit, but there are always critical voices questioning its "flashiness". However, car companies and suppliers have not slowed down the pace of technology landing.

In the voice of doubt, the development of gestures in the car did not stop

In 2013, a report by tech media outlet Engadget said Google filed a patent filing about using hand movements to control cars more efficiently. The patent relies on a depth camera and laser scanner mounted on the top of the cockpit to trigger the vehicle's relevant functions based on the position and movement of the occupant's hand. For example, if you swipe near the window, the car window will be automatically rolled down, and the system will automatically increase the volume with your finger on the car radio.

At the same time, car companies are not idle. At CES in the United States in 2014, Kia released a concept car called "KND-7", which is equipped with a gesture recognition information interaction system.

At the 2014 Beijing Auto Show, JAC Motor exhibited the SC-9 concept car, equipped with a human-computer interaction system called PHONEBOOK, based on the Development of the Windows System. Close to the bottom of the central control screen, there is a large area of induction area. Not only can you use various gestures to recognize the operation of the car machine, but also support the air writing function, and only support English input when publishing.

BMW's gesture control system first debuted on the G11/G12 7 Series launched in 2015, which was also the first time that the air gesture was seen in a production model, and the supplier was Delphi in the United States. The user only needs to draw some preset gestures in the air, and the 3D sensing area above the center console can quickly detect and recognize gesture movements, conveniently controlling functions such as volume or navigation.

For example, pointing your index finger forward and rotating clockwise can increase the volume, and rotating counterclockwise corresponds to lowering the volume; a horizontal V-shaped gesture against the car screen can be turned on or off; waving your palm in front of the car screen can reject or ignore the prompt, and "clicking" the air with your finger corresponds to answering the phone or confirming the prompt.

Looking at The country, the same form of interaction, independent brands have given different answers.

Regal SEEK 5, launched in 2018, provides 9 over-the-air gesture interactions, recognized by a dedicated camera under the central control screen.

When there is an incoming call, the gesture to the screen to the telephone handset represents answering, and vice versa represents hanging up.

Seeing this, I remembered a story told by an interaction designer in the article: a young kindergarten teacher asked the children to perform a call together, and the children learned to answer the mobile phone and put their palms to their ears, only he himself held up a gesture that was better than "six". Here, the generations and differences in culture determine the difference in cognition.

Four fingers gathered to extend the thumb to the left or right, representing the cut song.

The movement of the palm of the hand upwards indicates that "come over" indicates that the volume increases, and the downward motion of "sitting down" indicates that the volume decreases.

Horizontal V-shaped gestures operate the playback and pause of music, and from fist-clenching to spreading palms, a blooming rose will appear on the screen, and the sense of ceremony is damn sweet.

The Great Wall's WEY mocha is equipped with a gesture summoning function, and the owner can control the car with his bare hands outside the car.

I have to say that anyone who sees this scene will probably remember the scene of their own parking with the help of the parking lot management, but the original two humans have been replaced by one person and one car here.

On the Ford EVOS, which will be available in 2021, the 1.1-meter-long screen that can be split in two or in one is impressive. In order to make users better use the screen, the Ford EVOS team also designed a series of interactive gestures:

Put your index finger on your lips like a "boo" gesture and the music will automatically pause;

With an "OK" gesture, the music will replay;

Than a V-shaped gesture, you can switch between split screen and full screen;

Than a five-finger grab action, you can go directly back to the main page.

Southern fist and north leg, the technical genre behind the air gesture

As mentioned earlier, the big premise for implementing interaction is perception and intelligence. There are two major known technical genres of mainstream gesture interaction:

Radar Faction:

The technology of this genre mainly monitors hand movements through micro-millimeter radar waves to achieve the purpose of gesture recognition.

Google's Project Soli, announced in 2015, is a sensing technology that monitors gestures in the air by using miniature radar. Specially designed radar sensors track high-speed movements with millimeter accuracy, and then process the radar signals to recognize them as a series of universal interactive gestures.

Continuous research and development has enabled so the Soli radar to be millimeter-sized, making it easy to cram into mobile phones and wearables.

One of Project Soli's most famous landing examples is the 2019 Google-released Pixel 4 phone, which uses Soli radar to implement a technology called Motion Sense. Users can achieve a series of control actions through air gestures without touching the screen, such as switching music, muting mobile phones, adjusting the sound size of alarm clocks, and so on. Pixel 4's face unlock also relies on millimeter wave, and even without any requirements for light, it can be unlocked in the dark.

Visualism:

This genre uses computer vision to identify hand feature points and is more widely used than the former.

Although the technology school represented by Soli radar has the advantages of strong directionality and strong ability to resist environmental interference, this does not prevent car companies and suppliers from favoring the path of gesture control through computer vision.

Perhaps many people remember the Kinect somatosensory peripherals on Microsoft's XBOX series of consoles. The depth sensing technology used by Microsoft Kinect automatically captures the depth image of the human body and tracks the human skeleton in real time, detecting subtle movement changes.

Gesture recognition technology can be roughly divided into three levels from simple to deep: two-dimensional hand recognition, two-dimensional gesture recognition, and three-dimensional gesture recognition. If we only need to meet the most basic controls such as "play/pause", the combination of two-dimensional hand type/gesture + single camera capture is enough to meet the needs. Like the living room scene with streaming video on a smart TV, when we have to leave for a short time and don't want to miss the content, we can pause the TV with a simple gesture.

However, the sense of space in the car is not as simple as the sofa and the living room, so three-dimensional gesture recognition with more in-depth information is necessary, and the complexity of the corresponding camera hardware will increase.

Supporting Microsoft Kinect's depth sensing technology to implement air interaction, the two generations can separately disassemble the two mainstream technology paths of gesture interaction: Structure Light and Time of Flight, plus multi-angle imaging (Multi-Camera), which constitutes the three main visual technology schools of gesture interaction.

Structure Light

Representative app: The original Kinect on xbox 360 by vendor PrimeSense

Principle: The laser emitted by the laser projector will be deflected when it is projected through a specific grating, causing the laser to shift at the landing point on the surface of the object. The camera is used to detect and collect the pattern projected onto the surface of the object, and through the displacement change of the pattern, the position and depth information of the object are calculated by algorithm, and then the entire three-dimensional space is restored, and the gesture recognition and judgment are carried out according to the known pattern.

In the case of the first generation of Kinect on xbox 360, the best recognition effect can only be achieved within a specific range of 1 to 4 meters. This is because the technology relies on the landing point displacement generated by laser refraction, so it is not too close or too far, and it is not very good at dealing with reflective interference from objects, but it is better because the technology is relatively mature and the power consumption is relatively low.

Time of Flight

Representative applications: Intel perceptual computing technology by vendor SoftKinetic (acquired by Sony), Kinect II on XBOX ONE

Principle: As the name suggests, the principle is also the simplest of the three technical paths. The light emitting element continuously transmits an optical signal to the measured target, and then receives the optical signal returned from the measured target at the special CMOS sensor, and calculates the round-trip flight time of the transmitted/received optical signal to obtain the distance of the measured target. Unlike structured light, the device emits not scattered spots, but area light sources, so the theoretical working distance range is farther than the former.

Tof is more simplified in understanding, TOF is similar to the perception principle of bats as we know it, except that it emits not ultrasonic waves, but light signals. TOF is relatively high in anti-interference and recognition distance, and is also regarded as one of the most promising gesture recognition technologies.

Incidentally, with the help of recent spoiler campaigns for ideal L9, 3D TOF technology has taken another turn for the worse.

Multi-Camera

Representative applications: Linggan Technology Usens' Fingo gesture interaction module, Leap Motion's somatosensory controller of the same name

Principle: Use two (or more) cameras to shoot the current environment, get two (or more) photos of different viewing angles for the same environment, and calculate the depth information according to the geometric principle. Because the parameters of the plural camera and the relative position between each other are known, as long as the position of the same object in different pictures is found, the recognition effect of the object under test can be calculated by the algorithm.

To simplify the understanding, the binocular camera is similar to the human eye, and the polyphthermic camera is like the compound eye of the insect, which is algorithmically used to form a multi-angle three-dimensional imaging.

Polyangular imaging is one of the more extreme of the three. On the one hand, polygon imaging has the lowest hardware requirements, and on the other hand, because it completely relies on computer vision algorithms, the computational distortion data has very high requirements for algorithms. Compared with structured light and TOF technology, the actual power consumption of polyangular imaging is much lower, and the anti-interference resistance in strong light environment is excellent, which is a kind of inexpensive gesture recognition technology path.

Image source: Zhi Dong "Huawei Xiaomi OV mobile phone AI war accomplices! 》

So, is gesture interaction reliable?

Let's return to the question in the title, is gesture interaction reliable?

My answer is yes. Whether it is now or in the future of fully autonomous driving, gesture interaction in the car has great application potential, but it is still too early.

Car companies and suppliers have been difficult to play on the physical buttons to play any more flower work, in the shape, size, material of the touch screen to make a big fuss for a while and a half and failed to roll out revolutionary practical innovation, only the limelight of intelligent voice is still in the industry dividend period of technology development.

The space left for gestures is not small, and the technical limitations are only one aspect, the fact is that there are a lot of problems that need to be considered and solved by HMI designers, product managers and suppliers.

Recognition rate and stability

One of the biggest challenges facing the field of artificial intelligence is how to make intelligence that does not have human common sense and general knowledge understand the real world. How do algorithms distinguish between human real interaction intent and those that are unexpected, inadvertent, and spontaneous gestures?

The appearance is that the user thinks that he has done the right gesture, but the system cannot correctly recognize it; the user's inadvertent gesture action is captured and executed by the system "accurately".

In 2021, when we were conducting the intelligent cockpit evaluation work, a camera in the car repeatedly accidentally triggered the smoking perception in the car based on the captured image, and forced window ventilation. But at that time, I was just thinking about the problem, and I habitually made a hand-to-chin gesture. This kind of small accident runs through the entire work process, although it is not disturbing, but the perception is still not good.

There are many possible reasons behind this, such as the impact of environmental interference, the algorithm recognition threshold is too high, beyond the recognition range, the action is not standard, and so on. But blindly improving the recognition rate is by no means the right solution, just like the non-wake-free in the cockpit voice is a good feature, but the brainless global non-wake-free will make the system unable to distinguish whether the user is command interaction, self-talk or talk to others, which will cause a lot of trouble.

If gestures do not effectively distinguish between the intentions behind these movements, the erratic performance is even more troubling. Technical limitations When users need to reluctantly pay time and attention to smooth out these unexpected troubles, or fail to get the correct response they expect when they need to use them, it is a real inversion of the end for interactions.

Cultural similarities and differences

As mentioned earlier, the kindergarten children called, and we can see that it is only the time span from the post-90s to the post-10s, and the cognition of the action of "calling" is completely different. This represents the expression of gestural meaning influenced by intergenerational culture.

It has become the "OK" of the global gesture, and the use derived from social networks in the past two years has made young men in South Korea break the defense, and it has been serious to the extent that once similar symbols in publicity are questioned by the public, the brand will cancel and apologize.

Let's take another simple example.

Take the scissor hand shape that girls like to pose most when taking pictures, which is called peace gesture in English culture and means victory; the V gesture of the back of the hand against the other party represents a very offensive offense in Britain, saying that it was used by English longbowmen to ridicule and show off to the French troops who had wasted their kung fu before the war during the Hundred Years' War between England and France. Later, the "Victory and Peace" gesture was inadvertently carried forward through the famous photograph of then British Prime Minister Winston Churchill during World War II.

Seemingly simple gestures contain countless possibilities, and different gestures are given completely different meanings in different countries.

Therefore, HMI designers and product managers also need to consider more about the cultural background and customs of various places when designing gestures.

Learning costs

How many sets of interactive gestures can you remember? For me, the commonly used three or four groups is already the limit.

Huawei has boldly applied several sets of air gestures on its flagship mobile phones in recent years, such as the action of grabbing the palm of the hand towards the front five fingers to represent the screenshot, and the up and down hand throwing represents vertical sliding.

Does it work? Indeed, in most scenes, it is quite useful. The reality is that during the period of daily use of the Mate 30 Pro, only air screenshots are more commonly used, and they have to be achieved in a relatively well-lit environment.

I think Huawei has been very restrained in this regard. As we all know, the posture of human hands is ever-changing. The Fifty-Three Orchid Fingering Method of Mei Pai invented by Mr. Mei Lanfang, a master of Peking Opera, is already dazzling just by looking at the picture book.

Of course, Mr. Mei's orchid fingering is based on artistic expression, which is very different from the emphasis on instrumental gesture interaction.

Since gestures are interactions based on intuition, the design also needs to be more in line with human intuition, easy to remember, easy to use, and easy to become habits.

So what we need is --

Every time a new technology product comes out, people always like to say that "the future has come".

Let us fade the cloak of romanticism, technology research and development, product planning, expected communication, feedback and iteration is actually an incomparably long process. One step in place can only be a beautiful vision, otherwise the R&D behavior itself has no meaning.

Gesture interaction is a good complement to touch and voice, and even based on some scenes or personal wishes, it will be better than the former two. Of course, the combination and cooperation between different perceptual methods and interactions is more in line with the logic of product iteration than the expectation of an extremely developed sense alone, and the same is true of species evolution.

Just like the misawakening phenomenon mentioned earlier, if the camera in the car can be judged by combining it with the AQS air quality sensor readings, you will know that I am just habitually holding my chin in meditation, not smoking, and not playing cool.

I believe this future will not be long.

The connoisseur said

Any solution is valuable when it can effectively solve a difficult and valuable problem. The problem with gesture interaction is that it doesn't break ground up to the significant problems of cockpit interaction. The core bottleneck of cockpit interaction lies in the contradiction between the surge in task complexity and the investment of low visual cognitive resources. It is necessary for users to drive easily and safely, but also to complete complex tasks such as setting destinations and browsing and selecting songs.

Gesture interaction, especially air gesture interaction, compared to touch, voice and physical interaction, there is no way to solve the above bottlenecks, which will only bring greater trouble. This includes that gestures require the user to re-learn, that the sensor of the gesture itself is not accurate enough to lead to misoperation, and that hand fatigue is caused by hand dangling.

A small amount of gesture interaction on the touch screen can be considered, such as returning to the previous level and returning to the home page. These few, sophisticated gestures must be easy to learn, intuitive, and only available to skilled users as backup advanced operations. The average user still needs "visible" controls to avoid the learning threshold.

Gesture interaction, in the end, is it reliable?

Read on

Microsoft's patent proposes to present emoji expressions based on gesture recognition for AR/VR social

Can't shoot Vlogs? Honor 60 series five major gesture recognition is true Vlog "artifact"

Absolutely beautiful appearance Honor 60 series has become the industry designation of the hit drama "Sleeping Garden"

In addition to changing the mirror in the air, there is also a new experience of smart travel, and the Honor 60 series is worth starting in the New Year!

Gesture recognition interactions, touch the future with your fingers

A new way of interacting with VR is coming! The natural gesture recognition technology all-in-one machine program "Qi Ji" was launched

Meta released Quest 2 Hand Tracking 2.0 update

Apple's gesture recognition patent is granted: gesture control devices can be determined using veins

Apple's gesture recognition patent is granted for use in vein scanning devices that automatically gesture and finger recognition