MOE - Mixture of Experts - Beginner friendly

OpenAI had a blast releasing ChatGPT, with more than a million people signing up to test it in the very first week. It became a fad with the early adopters soon after its release and have started becoming a popular tool for developers, writers, etc. who seek to reduce their time spent on mundane tasks that can be done quickly with the help of ML models, or to assist them when they are stuck on some problem. Recently there has been speculations that ChatGPT-3.5 or the version 4 is not indeed a monolithic model that is too good at General AI and instead is made of many models that work together (MOE) to arrive at a good prediction based on the user’s input. As it is not officially announced by any of the OpenAI members, it still remains a speculation.

But that does beg the question, What is MOE?

MOEs or Mixture of Experts, is a concept used in Machine Learning (ML) that was coined in 1994 by Michael I. Jordan and Robert A. Jacobs, and was adapted by many firms, with Google’s T5 (Text-to-Text Transfer Transformer) being a major application to apply it on scale.

As the name suggests, MOE has several experts working together to form a solution to a given query. Thus it is a part of a ML model that is specifically designed to handle a subset of data and not the entire dataset. So for example, when a user inputs a query, all the experts make their own predictions based on the context of the query. After which, a gating network, which is another ML model decides how much weight needs to be assigned to each expert’s prediction in order to form the final output.

Since the final output is a weighted mixture of many experts, such an architecture is called as the Mixture of Experts.

The power of this approach is it’s ability to create a model that can handle complex and varied data more effectively than what a single, monolithic model can handle. This is particularly valuable in situations where the input data can come from different domains or have different statistical properties that makes it difficult for a monolithic model to accurately capture all the nuances of the data.

Credits: https://www.nature.com/articles/s41598-021-00524-y

To understand everything, let’s consider a real world example of AI, that is supposed to predict the competitive landscape of a firm. What do I mean by that? Crayon, for example, provides a market intelligence platform that uses AI to track, analyze, and act on everything happening outside a company’s four walls. They track more than 100 different types of market data to provide a comprehensive view of a company’s competitive landscape. Using this AI tool can help firms to come up with better marketing strategies to tackle their competition.

While I do not know how Crayon is built or what architecture they use, let’s assume for understanding purpose that it is based on an MOE architecture.

In Crayon’s context, they track more than 100 different types of market data to provide a comprehensive view of a company’s competitive landscape. This data spans from social media activity, product updates, reviews, news articles, to other publicly available data. Each of these data types can be handled by a different expert (Model), the following are some of them.

Social Media Expert: This expert model can be trained specifically on social media data. It will expertly analyze posts, interactions, sentiment, and trends to gain insights into a competitor’s social media strategy, audience engagement, trending topics, etc.
Product Update Expert: This expert model will focus on analyzing updates, new features, and releases from competitor products. It will need to understand technical language and be capable of summarizing key points from product change logs, update posts, and similar sources.
Review Analysis Expert: This expert will specialize in understanding customer reviews across different platforms. It will extract information about customer satisfaction, product strengths and weaknesses, common complaints, and overall sentiment.
News Analysis Expert: This expert will be trained to analyze news articles, press releases, and blogs about competitors by extracting key pieces of information, summarize articles, and identify significant events like partnerships, acquisitions, or leadership changes.
SEO Expert: This expert will be trained to analyze competitor websites and SEO strategies to identify key SEO tactics, popular keywords, backlink strategies, etc.
Pricing Analysis Expert: This expert will specialize in understanding competitor pricing strategies and can provide alerts on pricing changes or promotions.

The Gating Network in this case will decide which expert or experts should handle each piece of incoming data, depending on the data type and context. It could be a rule-based system that simply directs data to the appropriate expert based on the source of the data (e.g., if the data is a social media post, send it to the Social Media Expert). Or it could be a more complex ML model that learns to direct data based on both the source and the content of the data.

MOEs have become an underlying engine in many complex applications that require the Neural networks to understand data of varied nature, to predict something meaningful or creative. More on this blog in the coming months. Until then, stay tuned!