SERVERS

The race is on to create one neural network that can process multiple kinds of data -- a more-general artificial intelligence that doesn't discriminate about types of data but instead can crunch them all within the same basic structure.

Artificial Intelligence

8 ways to reduce ChatGPT hallucinations
AI is transforming organizations everywhere. How these 6 companies are leading the way
3 ways AI is revolutionizing how health organizations serve patients. Can LLMs like ChatGPT help?
If AI is the future of your business, should the CIO be the one in control?

The genre of multi-modality, as these neural networks are called, is seeing a flurry of activity in which different data, such as image, text, and speech audio, are passed through the same algorithm to produce a score on different tests such as image recognition, natural language understanding, or speech detection.

And these ambidextrous networks are racking up scores on benchmark tests of AI. The latest achievement is what's called "data2vec," developed by researchers at the AI division of Meta (parent of Facebook, Instagram, and WhatsApp).

The point, as Meta researcher Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli reveal in a blog post, is to approach something more like the general learning ability that the human mind seems to encompass.

"While people appear to learn in a similar way regardless of how they get information -- whether they use sight or sound, for example -- there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities," the blog post states.

The main point is that "AI should be able to learn to do many different tasks, including those that are entirely unfamiliar."

Meta's CEO, Mark Zuckerberg, offered a quote about the work and its ties to a future Metaverse:

People experience the world through a combination of sight, sound, and words, and systems like this could one day understand the world the way we do. This will all eventually get built into AR glasses with an AI assistant so, for example, it could help you cook dinner, noticing if you miss an ingredient, prompting you to turn down the heat, or more complex tasks.

The name data2vec is a play on the name of a program for language "embedding" developed at Google in 2013 called "word2vec." That program predicted how words cluster together, and so word2vec is representative of a neural network designed for a specific type of data, in that case text.

Also:Open the pod bay doors, please, HAL: Meta's AI simulates lip-reading

In the case of data2vec, however, Baevski and colleagues are taking a standard version of what's called a Transformer, developed by Ashish Vaswani and colleagues at Google in 2017, and extending it to be used for multiple data types.

The Transformer neural network was originally developed for language tasks, but it has been widely adapted in the years since for many kinds of data. Baevski et al. show that the Transformer can be used to process multiple kinds of data without being altered, and the trained neural network that results can perform on multiple different tasks.

In the formal paper, "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language," Baevski et al., train the Transformer for image data, speech audio waveforms, and text language representations.

The very general Transformer becomes what is called a pre-training that can then be applied to specific neural networks in order to perform on specific tasks. For example, the authors use data2vec as pre-training to equip what's called "ViT," the "vision Transformer," a neural network specifically designed for vision tasks that was introduced last year by Alexey Dosovitskiy and colleagues at Google.

Meta shows top scores for the venerable ImageNet image-recognition competition.

Meta 2022

When used on ViT to try to solve the standard ImageNet test of image recognition, their results come in at the top of the pack, with accuracy of 84.1%. That's better than the score of 83.2% received by a team at Microsoft that pre-trained ViT lead by Hangbo Bao last year.

And the same data2vec Transformer outputs results that are state-of-the-art for speech recognition and that are competitive, if not the best, for natural language learning:

Experimental results show data2vec to be effective in all three modalities, setting a new state of the art for ViT-B and ViT-L on ImageNet-1K, improving over the best prior work in speech processing on speech recognition and performing on par to RoBERTa on the GLUE natural language understanding benchmark.

The crux is that this is happening without any modification of the neural network to be about images, and the same for speech and text. Instead, every input type is going into the same network and is completing the same very general task. That task is the same task that Transformer networks always use, known as "masked prediction."

Also:Google's Supermodel: DeepMind Perceiver is a step on the road to an AI machine that could process anything

The way that data2vec performs masked prediction, however, is an approach known as "self-supervised" learning. In a self-supervised setting, a neural network is trained or developed by having to pass through multiple stages.

First, the network constructs a representation of the joint probability of data input, be it images or speech or text. Then, a second version of the network has some of those input data items "masked out," left unrevealed. It has to reconstruct the joint probability that the first version of the network had constructed, which forces it to create increasingly better representations of the data by essentially filling in the blanks.

An overview of the data2vec approach.

Meta 2022

The two networks, the one with the full pattern of the joint probability, and the one with the incomplete version that it is trying to complete, are called, sensibly enough, "Teacher" and "Student." The Student network tries to develop its sense of the data, if you will, by reconstructing what the Teacher has already achieved.

You can see the code for the models on Github.

How is the neural network performing Teacher and Student for three very different types of data? The key is that the "target" of joint probability in all three data cases is not a specific output data type, as is the case in versions of the Transformer for a specific data type, such as Google's BERT or OpenAI's GPT-3.

Networking

The best 5G laptops: Samsung, Lenovo, and more lead with 5G connectivity
How to test your internet speed the quick and easy way
The top 5 VPN services: How do these VPNs compare?
Why is my internet so slow? 11 ways to speed up your connection

Rather, data2vec is grabbing a bunch of neural network layers that areinsidethe neural network, somewhere in the middle, that represent the data before it is produced as a final output.

As the researchers write, "One of the main differences of our method [...] other than performing masked prediction, is the use of targets which are based on averaging multiple layers from the teacher network." Specifically, "we regress multiple neural network layer representations instead of just the top layer," so that "data2vec predicts the latent representations of the input data."

They add, "We generally use the output of the FFN [feed-forward network] prior to the last residual connection in each block as target," where a "block" is the Transformer equivalent of a neural network layer.

The point is that every data type that goes in becomes the same challenge for the Student network of reconstructing something inside the neural network that the Teacher had composed.

This averaging is different from other recent approaches to building One Network To Crunch All Data. For example, last summer, Google's DeepMind unit offered up what it calls "Perceiver," its own multi-modal version of the Transformer. The training of the Perceiver neural network is the more-standard process of producing an output that is the answer to a labeled, supervised task such as ImageNet. In the self-supervised approach, data2vec isn't using those labels; it's just trying to reconstruct the network's internal representation of the data.

Even more ambitious efforts lie in the wings. Jeff Dean, head of Google's AI efforts, in October teased about "Pathways," calling it a "next generation AI architecture" for multi-modal data processing.

Mind you, data2vec's very general approach to a single neural net for multiple modalities still has a lot of information about the different data types. Image, speech, and text are all prepared by pre-processing of the data. In that way, the multi-modal aspect of the network still relies on clues about the data, what the team refer to as "small modality-specific input encoders."

Also: Google unveils 'Pathways', a next-gen AI that can be trained to multitask

We are not yet at a world where a neural net is trained with no sense whatsoever of the input data types. We are also not at a point in time when the neural network can construct one representation that combines all the different data types, so that the neural net is learning things in combination.

That fact is made clear from an exchange between ZDNet and the researchers. ZDNet reached out to Baevski and team and asked, "Are the latent representations that serve as targets a combined encoding of all three modalities at any given time step, or are they usually just one of the modalities?"

Baevski and team responded that it is the latter case, and their reply is interesting enough to quote at length:

The latent variables are not a combined encoding for the three modalities. We train separate models for each modality but the process through which the models learn is identical. This is the main innovation of our project since before there were large differences in how models are trained in different modalities. Neuroscientists also believe that humans learn in similar ways about sounds and the visual world. Our project shows that self-supervised learning can also work the same way for different modalities.

Given data2vec's modality-specific limitations, a neural network that might truly be One Network To Rule Them All remains the technology of the future.

Featured

iPhone 15 Pro review: Prepare to be dazzledGenerative AI will far surpass what ChatGPT can do. Here's everything on how the tech advancesGoogle Pixel 8 vs. Google Pixel 8 Pro: Which one is right for you?The best USB-C cables for the iPhone 15: What the experts recommend

iPhone 15 Pro review: Prepare to be dazzled
Generative AI will far surpass what ChatGPT can do. Here's everything on how the tech advances
Google Pixel 8 vs. Google Pixel 8 Pro: Which one is right for you?
The best USB-C cables for the iPhone 15: What the experts recommend

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Meta's 'data2vec' is a step toward One Neural Network to Rule Them All

Artificial Intelligence

Networking

Featured

Hot Tags : Artificial Intelligence Innovation

Ordering Guide

Resources

About Us

Huawei CloudEngine S5731‑S48P4X Datasheet