Organize | Ling Min, Nuclear Coke
Waiting and waiting, Musk finally delivered on his open source promise.
Musk open-sourced the Twitter recommendation algorithm
On March 31, as Musk has repeatedly promised, Twitter has officially open-sourced part of its source code, including an algorithm that recommends tweets in user timelines. So far, the project has gained 10k+ stars on GitHub.
GitHub address: https://github.com/twitter/the-algorithm
Musk said on Twitter that the release is "most of the recommendation algorithms", and the rest of the algorithms will be opened one after another. He also mentioned that he wants "independent third parties to determine with reasonable accuracy what Twitter may show to users." In a Space discussion about the algorithm release, he said the open source initiative is to make Twitter "the most transparent system on the Internet" and as robust as Linux, the best-known and successful open source project. "The overall goal is to make it as enjoyable as possible for those who continue to support Twitter."
The Twitter blog details what the algorithm refers to, ranks, and filters when determining which tweets appear on the For You timeline.
The main components used to build the timeline
Judging from the blog post, the recommended pipeline consists of three main phases.
First, it collects "the best tweets from different push sources," and then uses a "machine learning model" to rank each tweet. Finally, it filters out Tweets from blocked users, Tweets that have already been seen, or content that is inappropriate to watch during business hours, and finally displays the results on the timeline.
The specific steps in the process are further explained.
For example, the first step looks at about 1,500 Tweets, with the goal of having about 50% of Tweets in the For You timeline come from people who have followed (i.e., "in-network") and 50% of Tweets from "out-of-network" accounts that are not yet followed. Rankings are "optimised for engagement positivity (e.g., likes, retweets, and replies)," and the final step is to ensure that users don't see too many tweets from the same person.
Admittedly, code transparency (users can see exactly how the system chooses a tweet for the timeline) and open source (allowing the community to submit their own code as an alternative, and also use the Twitter algorithm in other projects) are not exactly the same thing.
Although Musk has repeatedly mentioned open source, if Twitter wants to be true to its word, it must meet the latter's criteria. In other words, Twitter needs to establish a new governance system that decides which PRs to approve, which issues to focus on, and how to stop malicious people from breaking code for personal purposes.
For now, Twitter is working on this. The readme on GitHub mentions, "We invite the community to submit issues and PRs on GitHub to suggest improvements to the recommendation algorithm." But the document also says that Twitter is still building "tools for advice management and synchronizing changes to internal repositories." Twitter under Musk has made a lot of promises, but it has not been able to stick to it, so I am afraid that it will not be able to determine whether this is true until it actually receives the community code.
Musk's commitment to open source
Previously, Musk has repeatedly said that he will open source the Twitter algorithm.
In March 2022, Musk launched a survey on Twitter asking users what they thought about the open source of the platform's algorithms. He wrote: "I worry that the bias that actually exists in Twitter's algorithm will have a big impact, how do we know what's going on behind the scenes?" Musk believes that the more we trust Twitter as a public platform, the less risk to civility.
In May 2022, Musk had a dispute with Twitter co-founder and former CEO Jack Dorsey over the platform's algorithm. Musk said, "Algorithms are manipulating you in ways you don't realize ... I'm not saying the algorithm is malicious, but it does guess what you want to see, and then inadvertently manipulates/amplifies your point of view without you being completely aware of what's going on. ”
After taking over Twitter in October 2022, Musk's thoughts about the open-source Twitter algorithm have not changed either.
On February 21, 2023, Musk said that he would open source the Twitter algorithm next week. At the time, a Twitter user said they would be "genuinely convinced" if Twitter could open source the algorithm. Musk responded: "When we open source the algorithm next week, be prepared to be disappointed at first, but it will improve quickly after that." ”
Unfortunately, Musk has not fulfilled his promise of "open source next week". Until March 18, Musk spoke up again: "Twitter will open source all the code used for tweet recommendations on March 31." ”
Musk said: "Our 'algorithm' is too complex and not fully understood internally. People will find a lot of stupid things, but we will patch the problem as soon as we find it. We are developing a simplified way to deliver more engaging tweets, but this work is still a work in progress and this will also be open source. Providing code transparency can be awkward at first, but it should lead to a quick increase in the quality of recommendations. Most importantly, we want to earn your trust. ”
However, embarrassingly, according to the Associated Press local time on March 26, a legal document shows that part of Twitter's source code was leaked and posted on the open source programming and code hosting website GitHub. To prevent the incident from potentially damaging damage to its services, Twitter has taken legal action, and GitHub has complied with the notice and removed the leaked code.
Now, Musk has finally got his wish to open source the Twitter algorithm, but his decision also faces strong opposition. Users have expressed displeasure with the frequent display of Musk's tweets on their For You page, while Musk's supporters worry that their participation in the community is declining. He defended that the new recommendation algorithm wanted to "minimize the cut" of negative and hateful content, but outside analysts who had previously lost access to the code weren't buying that claim.
In addition, Twitter may face competitive pressure from the open source community. Mastodon is a decentralized social network that is currently gaining popularity in certain circles. Twitte r co-founder Jack Dorsey is backing a similar open source project called Bluesky.
The underlying working mechanism of Twitter's recommendation algorithm
With a complex system like Twitter, open-source algorithms is no easy task. In an article, open-source author Travis Fischer analyzed that Twitter's recommendation algorithm is provided by a personalized recommendation system that predicts which tweets and users are most likely to interact with. The two most important parts about this referral system are:
The basic data used to train the ML model, that is, Twitter's large-scale private network diagram;
Ranking information to consider when determining relevance.
Large-scale VPC diagram
Social networks like Twitter are examples of a giant graph, where nodes are models of users and tweets, and edges are models of interactions such as replies, retweets, and likes.
Visualization of the Twitter Dynamic Network Graph by Michael Bronstein, from Twitter's Graph ML Division (2020).
A large part of Twitter's core business value comes from this vast underlying data set of users, tweets, and interactions. Users log in, view Tweets, click on Tweets, view user profiles, post Tweets, reply to Tweets, etc., and every interaction on Twitter is recorded in an internal database.
The data obtained from Twitter's public APIs is only a small part of Twitter's internal tracking data. This is important because Twitter's internal recommendation algorithm has access to all this rich interaction data, whereas any open source effort may only be able to use a limited data set.
Ranking information
In 2017, Twitter researchers mentioned in an article titled "Using Large-Scale Deep Learning on the Twitter Timeline" that in order to predict whether a tweet will attract users, Twitter's model considers the following points:
The tweet itself: its recency, the media card (image or video) present, the total number of interactions (such as the number of retweets and likes).
Tweet author: the user's past interactions with this author, the strength of the user's connection with them, the origin of the user's relationship.
Users: Tweets that users have found attractive in the past, how often and to what extent users use Twitter. According to the researchers, "The list of features we consider and their various interactions is growing, providing our model with more nuanced patterns of behavior." ”
These 2017 ranking descriptions may be a bit outdated, but these core messages are still highly relevant to Twitter today. Because this list is likely to have been generalized to dozens or even hundreds of key machine learning models that underpin Twitter's algorithm.
A visualization of a deep learning model that determines the likelihood that one user will follow another user in the future. This model represents a small subset of the various recommendation systems within Twitter.
Travis Fischer believes that open-sourcing Twitter's recommendation algorithm will inevitably encounter some significant engineering challenges.
Twitter's network graph, for example, is huge, with hundreds of millions of nodes and billions of edges. Twitter's real-time nature presents another unique challenge: users want Twitter to be as close to real-time as possible, which means that the underlying network diagram is highly dynamic and latency becomes a real user experience problem. There are also reliability, security and privacy challenges.
But regardless, Musk delivered on his promise of open source, and the open source of Twitter's recommendation algorithm also marks a key step in transparency for such platforms.
Reference Links:
https://www.theverge.com/2023/3/31/23664849/twitter-releases-algorithm-musk-open-source
https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm
https://www.infoq.cn/article/Es2BoMREB9JofbzQ2SBU