**SenseTime’s digital human research has become a hit on CVPR**
Fine modeling of expressions and gestures, algorithms to make digital people dance, 3D models generated from 2D pictures… Recently, there have been more and more new technologies for the metaverse and digital people in the AI â??â??field.
One of the most important academic conferences on artificial intelligence, CVPR 2022, is in progress in the past few days. This year’s conference has received more than 10,000 submissions and received more than 2,000 papers, which is the largest one in history.
At the conference, SenseTime and its joint labs had 71 papers accepted, nearly a quarter of which were Oral (oral presentation) papers. It is worth noting that many of these latest researches cover cutting-edge fields such as 3D digital human and 3D vision, and they are leading the technology application trend in the metaverse field.
In the future AR and VR environment, we need high-quality immersive content to achieve efficient and low-cost spatiotemporal expansion experience, which means that the application of AI technology that automatically generates content is almost the only feasible way. In the AI â??â??researcher community, some recent developments are eye-catching.
Let digital people learn to dance by themselves
So far, most of the digital people we have seen will only stand on the side and express their opinions, but they are born as “humans”, and their natural communication ability is not everything. Can the movements of virtual characters be independent of motion capture? But what about fully automatic AI-generated actions?
The difficulty in driving 3D characters to automatically dance with music is that the generated dance movements must not only ensure the standard and aesthetics of the movements in space, but also need to maintain consistency with different music rhythms in time. So this is an extremely challenging task.
Researchers from Nanyang Technological University, Sun Yat-Sen University, UCLA and SenseTime proposed a new music-to-dance framework, Bailando, in the paper “Bailando: 3D Dance Generation via Actor-Critic GPT with Choreographic Memory”. “Dance Memory” and “Actor-Critic” GPT solve the above-mentioned “space” and “time” challenges and achieve high-quality AI choreography.
Most of the related researches before this hope to realize choreography by designing an ingenious network to directly map music into a high-dimensional continuous human pose space. However, since the target space of the mapping contains both standard dance postures and non-standard postures other than dance movements, such methods are usually unstable in practice and are prone to return to non-standard postures (such as freezing or strange shaking) .
In order to limit the movements to the scope of human dance, some studies have collected real dance segments as dance units, and choreographed them by permuting and combining these units. However, the collection of dance units not only requires a lot of labor, but also the beats and speeds of dance units collected through such methods are fixed and cannot be reused for music of different rhythms.
In response to the above problems, two main components are designed in the dance generation framework Bailando: “choreographic memory and (Actor-Critic) GPT.
The first is the “choreography memory” module. In order to solve the space challenge, Bailando summarizes the subspace of only standard dance poses by performing unsupervised learning on dance data, and limits the mapped target space to standard dance movements. It is worth noting that the new method does not manually label dance units, but uses unsupervised learning to encode and quantify 3D joint sequences into a codebook to learn important and reusable dance elements in dance.
In order to further expand the range of dance memory that can be represented, the researchers divided 3D poses into combinations of upper and lower bodies for AI to learn separately, so that a dance can be represented as a series of paired pose codes.
Then, to combine these encoded dance moves into a dance, the authors introduce a GPT-like network called motion GPT to convert the music into a sequence of dance encodings. Since 3D poses are divided into upper and lower bodies in [choreographic memory], it is also necessary to enhance the motion GPT through a cross-conditional causal attention layer to ensure the coordination of the upper and lower bodies.
Bailando’s reasoning process: Given a piece of music and a starting pose encoding pair, actor-critic GPT autoregressively predicts future pose encoding sequences, and then uses [choreographic memory] to convert the encoding sequences into quantized features, and finally CNN-based Dedicated decoder decodes 3D dance moves.
Action avatars are only half the battle, we still have to get them to the beat. The researchers introduced an “Actor-Critic”-based reinforcement learning scheme to the GPT network, and added a newly designed beat-alignment reward function to make the generated dance and music beats synchronously aligned in time.
Extensive experiments on standard datasets show that the new framework achieves the best results both qualitatively and quantitatively.
Extensive experiments on the dataset demonstrate that our new framework achieves state-of-the-art results (SOTA) both qualitatively and quantitatively.
Bailando can drive digital people to dance with background music, and is expected to become the underlying technology for driving virtual anchors in the future. In environments such as games and animation, the ability of the model to generate high-quality dance can also be used to assist or replace human choreography, greatly reducing costs.
Recreate humans in the metaverse
In addition to letting “NPC” move hands and feet, we also hope that the virtual world can describe its own image more accurately.
In the paper “Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer”, researchers from Hong Kong Chinese, Hong Kong University, Sydney University and SenseTime have implemented a visual understanding model specifically optimized for human images.
Recently, the transformer framework, which was originally used in the field of natural language processing, has shown great capabilities in tasks such as computer vision face alignment, pose estimation, and 3D body mesh reconstruction.
Most of the transformer networks in the field of computer vision directly divide the image into grid regions of the same size and shape, and each grid region is represented by a token. This segmentation method ignores the difference between the human body and the background and different parts of the human body, which limits the reconstruction accuracy of the network on human body details such as gestures and expressions.
The new research proposes a new transformer network structure TCFormer for the human-centered visual understanding task. It uses a token division method based on feature clustering, which can dynamically adjust the size, shape and position of the token according to the semantic information of the image. for important picture details.
TCFormer has achieved state-of-the-art results in image-based human body key point estimation, face key point estimation and human 3D mesh reconstruction tasks, and has achieved significant performance improvements in the reconstruction accuracy of human body details.
Specifically, to address the detail loss problem, researchers propose a multi-stage token aggregation method (MTA) that preserves image details at all stages in an efficient manner. The MTA head starts from the token of the previous stage, and gradually upsamples the token and aggregates the features of the previous stage until the features of all stages are aggregated. The aggregated labels correspond one-to-one with the pixels in the feature map and are reshaped into a feature map for subsequent processing.
TCFormer uses a multi-stage architecture consisting of 4 hierarchical stages and a multi-stage token aggregation (MTA) header. Each stage consists of several stacked transformer blocks. Between two adjacent stages, a cluster-based token merging (CTM) block is inserted to merge tokens and generate tokens for the next stage. The MTA head aggregates the token features from all stages and outputs the final heatmap.
SenseTime researchers said that TCFormer’s work mainly focuses on human-related tasks and can be applied to applications related to human pose estimation, such as SenseMARS Avatar and SenseMARS Agent, which all involve human pose estimation. With TCFormer, we are able to better capture the details, and then provide more refined pose estimation results in the application, so as to achieve more detailed and complex effects.
In the paper, the researchers achieved significant improvements on the wholebody dataset, a task that requires the algorithm to simultaneously estimate key points on the human body, hand, and face. TCFormer’s full-body pose estimation accuracy (57.2% AP and 67.8% AR) is higher than the best methods in the industry, especially on hand keypoint detection, the new method performs well, which proves TCFormer’s ability to capture small-scale key image details. ability.
TCFormer can simultaneously record human movements, expressions and gestures, thereby making the avatars in virtual reality and metaverse applications more vivid and flexible, and people can control various virtual characters more finely, thereby achieving a deeper sense of immersion.
For example, in games such as VRChat, if the user’s movements, expressions and gestures can be vividly reconstructed by means of images, the game experience can be greatly improved.
The current VRChat is like this, the characters’ movements are not flexible.
TCFormer can also be used to help virtual idols perform, and if the characters are more vivid, they can produce better performances.
In somatosensory games, a more refined image understanding can also make the user’s operations more detailed and enhance the sense of immersion. In the future, through algorithms, we may no longer need complex motion capture equipment, and only need a camera to play metaverse games.
Develop AI technology and lead the digital human industry
CVPR, the top artificial intelligence conference, currently ranks fourth in the Google Scholar academic journals and conference rankings, next to Nature, New England Journal of Medicine and Science, and more than Cell and JAMA. Every year, CVPR research heralds the direction of computer vision technology.
The Metaverse is an important topic in the technology field recently. It is worth noting that as early as August 2020, SenseTime proposed its own mixed reality innovation platform SenseMARS.
This is a “Creator” platform for building the metaverse, including SenseMARS Avatar for creating virtual avatars of the metaverse, SenseMARS Agent for supporting the development of metaverse “indigenous people” such as digital humans, and SenseMARS for digital reconstruction of the physical world Tools such as Reconstruction.
At present, the SenseMARS platform has integrated more than 3,500 artificial intelligence models, supports perceptual intelligence and mixed and augmented reality systems (MARS), creating a new metaverse experience. With the blessing of SenseMARS, virtual characters in the metaverse can have intelligent behaviors and movements, allowing people to interact with AI naturally.
The digital human created with SenseMARS can not only “understand” human words, but also communicate with us through language, facial expressions, body and other actions. At the same time, through the training and learning of knowledge data in different fields, digital people can become our intelligent assistants in various fields.
And SenseMARS Reconstruction, with the help of multi-algorithm fusion, allows consumer electronic devices (such as mobile phones, action cameras and drones) to efficiently reconstruct 3D models of the physical world, from small objects to shopping malls, transportation hubs and even cities. High-precision reproduction.
The application of SenseTime Digital Human has entered our life. In February this year, Bank of Ningbo Shanghai Branch hired “Xiao Ning”, the No. 001 digital employee, to provide various business consulting and handling services for bank customers. Behind it is SenseTime’s full-chain service support for banks based on the “AI Digital Human Service Middle Office”.
According to reports, the digital person Xiaoning can answer more than 550 common business questions and more than 3,000 related business questions derived from it. Through the continuous operation optimization of the operation management platform, more than 50 business-related derived questions can be added every day.
In people’s impression, SenseTime has always been known for its leading technology. Since its establishment in 2014, the company has been encouraging research teams to combine research with industrial implementation, establishing technical barriers in the fields of smart cities, autonomous driving, smart cultural tourism, etc., promoting the development of the industry, and achieving remarkable results .
Such exploration is extending into the metaverse. In the prospectus listed at the end of last year, Shang Tang clearly pointed out that it will focus on investing in the Metaverse platform: the company plans to use 60% of the funds to enhance R&D capabilities, and the investment related to the Metaverse accounts for 40%. 20% of this will be used to enhance other AI R&D capabilities, including SenseMARS and SenseAuto.
When the AI â??â??big device of artificial intelligence infrastructure was put into use, Xu Li, the co-founder and CEO of SenseTime, said that the massive data should be disassembled and collided with the big AI device, and the potential value should be deeply excavated, so as to break the gap between cognition and application. boundary. Breaking the boundary is to realize the connection between the virtual and real world. SenseTime is promoting the comprehensive digital transformation of the physical world based on its own AI technology.
The wave of building virtual worlds will bring new opportunities. And in this, AI technology will play a crucial role.
https://www.sohu.com/a/559922819_129720