PDF Version: Download

The Data Access Problem: Limitations on Access to Public Data on Very Large Online Platforms

Richard Kuchta, Beatriz Almeida Saab, Lena-Maria Böswald

23.03.2023

Key Findings

Despite the upcoming obligations of the Digital Services Act (DSA) and the Code of Practice on Disinformation committing platforms to implement measures for increased transparency and to support the research community by providing access to data, some recent developments undermine that commitment. The move by Meta to stop its maintenance of CrowdTangle, a tool for data collection and analysis of Facebook and Instagram data, along with Twitter’s abrupt decision to impose a paywall for its developers’ application programming interface (API), which has been used by non-academic researchers, are examples of the contradiction between commitments and actions.

Although Meta’s platforms and TikTok allow access to data such as posts on pages, public groups or public accounts, they do not provide access to certain other types of data, despite their public nature. For example, comments on public Facebook or Instagram posts are not accessible through the API, nor are the posts of accounts that have set their visibility to “public”. This creates a blind spot for researchers, especially in cases when public and political figures use this feature for their public communication, or when analysing the spread of hate speech and disinformation in comments. There are also gaps in the provision of data for additional features on platforms, such as “stories” on Facebook and Instagram, “reels” on Instagram, or “shorts” on YouTube. Although YouTube has announced they will include shorts in their API access, at the moment of writing, this data remained unavailable.

Platforms like TikTok and Telegram provide access through API for developers. Although this API access is also available for researchers, such an API is not designed to serve research purposes, and imposes more obstacles, such as violations of terms of service or restricted access to certain types of data. TikTok's newly announced researcher API is a first step in the right direction but, under its current vetting process, only United States-based academic institutions qualify, and non-academic institutions and European academic institutions are omitted. While the developer’s API for Telegram offers a wide range of data, TikTok's API for developers is limited for researchers.

Our Recommendations

Based on these findings, we recommend the following changes in the practices of platforms to empower researchers by giving them greater access to data:

Platforms should provide access to all public data, including comments in public fora or on public posts. This is particularly important for comments and public posts on Facebook, Instagram and TikTok;
Platforms should regularly update their APIs to provide access to data on new features, such as stories or shorts. These features are used heavily by the users and represent a blind spot for researchers;
Platforms should develop and provide access to APIs for both academic and non-academic researchers. Restrictions on access to API’s, such as TikTok’s, which is only for developers, or Twitter's, which grants access only to academic institutions, limit research in the public interest. Without access to API, researchers are limited in their ability to understand how disinformation or hate speech is circulated. Platforms should also clearly and transparently communicate changes to their APIs; and
The Terms of Service on APIs should better reflect the DSA, allowing for the analysis of systemic risks without limitations, such as TikTok' conditions for data refresh and deletion processes. Such conditions preclude the monitoring of content moderation decisions.

Introduction

Social media have become an essential part of political discourse. Research on social media discourse has been crucial to identifying problems and threats for democracy. However, researchers face significant limitations in this work, due to problems with accessing data. The platforms themselves are the gatekeepers of data, providing access to different types of data, levels of privacy, or access points, often creating obstacles for research. Under the Digital Services Act (DSA), Very Large Online Platforms (VLOPs) will be obliged to provide research organisations with the data needed to assess their compliance with the DSA´s provisions. As this will be an important change for the research of the societal harm that can result from content and practices on social media platforms, we provide an overview here of data access for various platforms, analysing their approaches to fulfilling these obligations.

This report evaluates the current state of data access on VLOPs, comparing the nature and accessibility of their data. This is the first of a series of reports that will expand on the topic and cover more platforms and data categories. The aim of this series is to provide policymakers, researchers, and civil society with a better understanding of which data they can access, from which platforms, and how. We will monitor the development of data access over time, particularly to understand how it evolves under the implementation of the DSA.

Background

So far, the decision on who can access data is in the hands of platforms. Most of the major social media platforms have created access for developers or researchers through the use of APIs (application programming interfaces). While API access requires programming skills, it is more adaptable and allows researchers to gather data more precisely. Researchers without computer programming skills usually rely on commercial social media listening tools provided by platforms or third-party providers. Access to platform-provided social media listening tools is conditioned on a vetting process, providing platforms with control over who can access their data. Tools from third-party providers are not only costly, but they may also not necessarily be adapted to the particular needs of researchers.

The most important article of the DSA related to data access is Article 40, stating the rules for data access. DSA Article 40 is not yet in force, as we are still in the preparation phase, and the DSA will be applied in stages before entering into full force in February 2024. The DSA will change the current data access regime. Firstly, platforms will no longer be in control of which organisations are eligible to access their data, as these powers will be moved to the Digital Services Coordinator (DSC) in the member state where the relevant VLOP is located – in most cases, Ireland. The DSCs will decide on vetted researchers, including non-academic research organisations, and ask the relevant platform to provide data relevant to monitoring compliance with the DSA. The DSA strengthens platform data obligations even further, as the DSCs can ask a platform to explain the design of its recommender system.

Besides the DSA, there is also the Code of Practice on Disinformation, which is a co-regulatory instrument. Signatories of the Code of Practice on Disinformation have committed themselves to support the research community and provide access to data for the study of the spread of disinformation. Codes of Conduct are linked to the DSA, as they can be seen as a mitigation strategy VLOPs need to regularly update to mitigate systemic risks. The Code of Practice on Disinformation may become a Code of Conduct in the future. Before the DSA enters fully into force, the Code of Practice is the only document forcing signatories to ensure higher data transparency. At the moment, researchers can only access data that a given platform decides to share. Despite the DSA being in the implementation phase, recent decisions by Twitter to impose a paywall for data access and the buggy nature of Meta´s CrowdTangle tool, combined with a lack of investment in maintenance by Meta, threaten the above-mentioned obligations.

Data Access Overview by Platform

Facebook and Instagram

Meta’s platforms share data through a company-owned social media listening tool, CrowdTangle, which provides access to public pages and groups (Facebook) and public accounts (Instagram). Access is granted to vetted academic researchers or civil society organisations. There is no limit on historical data access, but researchers cannot access unverified public accounts with less than 50,000 followers, and have no access to comments under posts on public pages and public groups. Moreover, there isn´t access to new platform features, such as stories.

Youtube

YouTube provides access for researchers and developers through their API. This allows access to all public channels and provides unlimited access to historical data. YouTube recently introduced a new feature to share content, named “shorts”, and access to data for this is not yet available yet through the API.

TikTok

TikTok currently offers API access only to developers. Such data can be used by researchers, despite the limitations resulting from the fact it was not developed for research purposes. The developer API includes unlimited access to the historical data of public accounts. TikTok recently announced the development of a researcher API for United States-based academic institutions, including data on public profiles and keyword search results.

Twitter

Twitter had very well-developed access to data both for developers and vetted researchers and, hence, was also accessible through several third-party social media listening tools. After its acquisition by Elon Musk, certain changes were introduced, such as a paywall for developers’ APIs. The monetisation of access to researcher data was also proposed, but the platform decided to postpone this step. Despite the fact that researcher API is still accessible to researchers without a paywall, non-transparent and confusing communication creates obstacles. Moreover, academic API is granted only to academic researchers, which excludes non-academic research organisations, which, as a result, have to use API for developers.

Telegram, a messenger platform, provides API access only for developers, but this can be used by researchers and offers wide access to data. Features such as public groups or channels allowing one-way messaging to large audiences resemble social media platforms, making Telegram a hybrid of a social media platform and a messenger.

Platform	Which Instance?	API	Who has access?	Historical Data?	Public Spaces not Available
	Public Pages and Groups	Yes	Vetted Researchers + Developers	Unlimited	Public Accounts, Comments on Public Pages and Groups, Additional Features
	Public Accounts	Yes	Vetted Researchers + Developers	Unlimited	Comments on Public Accounts, Additional Features
	Public Channels	Yes	Researchers + Developers	Unlimited	Additional Features on Public Channels
	Public Accounts	Yes	Vetted Developers*	Unlimited	Comments on Public Accounts, Additional Features
	Public Accounts	Yes	Vetted Academic Researchers + Paid Access	Unlimited	Audio
	Public Channels and Groups	Yes	Developers	Unlimited	All public spaces available

Although the access to the instances cited in the table above seems comparable across all VLOPs, the design of and decisions made by platforms directly affect accessibility. On Twitter, researchers can also access comments or retweets through API, unlike on Meta’s platforms, where this data is not accessible. Even when the post in question is set to public, and thus visible to all other platform users, the underlying comments cannot be retrieved through the API. This points to the platform´s decision not to make such data accessible, and can be particularly problematic for research on hate speech, coordinated activity and inauthentic use of services, which occur in comments as well.

Limited Access to APIs and Data Collection Tools

Another element where access to data is limited is where platforms introduce new features. This includes “stories” or live-stream videos on Facebook, “stories” and “reels” on Instagram, and “shorts” on YouTube. While YouTube has announced that it is expanding its API to cover new features, Meta hasn´t publicly announced any plan to access stories, despite its wide-reaching use on Instagram. As social media users tend to behave according to the feature they use, having access to data from these data points is relevant for comprehensive social media research. For example, a recent study in the context of Tunisia identified the use of coordinated political live broadcasts on Facebook, through the “live” feature, as means of spreading disinformation.

In terms of who can access data, we can see that every platform approaches this issue differently. Meta is vetting organisations before granting access to CrowdTangle, and registration is necessary to receive this access. TikTok controls requests even for accessing its developers' API, making gaining access to data the most difficult among the platforms. Twitter's approach in providing researchers API is stricter than that of Meta, providing data access only to academic researchers. With its developers' API already behind a paywall, access to data for non-academic researchers has become extremely expensive, creating a major obstacle for their research activities. Twitter’s communication about its researcher API has been chaotic, making future research on the platform uncertain, especially given its recent announcement of unreasonably high API pricing. Although Telegram has an API only for developers, it is widely used by researchers, due to the wide range of data that can be retrieved from public channels and groups.

Type of Data Accessible

The table below provides an overview of which data metrics are available for each of the platforms studied in this paper. This cross-platform comparison helps in understanding trends in data access, the impact of data types on research outputs, and the improvements in access to data that policymakers must advocate for to ensure accurate insights into adherence to the new legislation.

Category of Data/Platform
Name (username/group/channel name)
Comments/ replies
Geolocation
Time of post
Reactions
URL of post
Followers/Group Members at the time of posting
Forwarded from/RT		Not applicable	Not applicable
Views on the post	If it's a video			Not applicable
Description of Post/Caption/Text

The table shows that the data different social media platforms generally provide through their APIs is similar. Regardless of the platform, researchers typically have access to information such as the username of the account posting or the channel/group name, the time at which the post was made, reactions to the post, the number of followers the account has, the number of subscribers a channel has, the number of views the post has received, and a description of the content being posted. These similarities make it easier for researchers to develop analyses across multiple platforms, providing for more consistent study.

There are still, however, some crucial data points on most platforms that are not available to researchers. One example is access to comments on posts, which only YouTube and Telegram provide. Access to comments on social media platforms is important for researchers, as it allows them to gather data on how disinformation is being shared, discussed, and received by social media users. Furthermore, comments can reveal patterns of user behaviour, such as whether the same individuals or groups are consistently spreading disinformation, and what tactics they are using to persuade others, helping, for example, for the detection of coordinated inauthentic behaviour. Comments can also be a rich source of information on discursive strategies used to promote disinformation and on societal aspects that make certain disruptive narratives more appealing to specific groups of users. For example, a recent study on the effectiveness of content moderation during the 2022 Italian general elections analysed user-generated comments on Facebook, Twitter, and YouTube, and found that hate speech and false news were prevalent in comments during the election campaigns.

TikTok's Insufficient Promises

Table 2 highlights that TikTok currently only allows developers vetted and approved by the social media platform to access the data described in the table. This presents a significant challenge for researchers who are interested in studying the platform and in understanding how disinformation is being shared and amplified. TikTok is a rapidly growing social media platform and is increasingly being used to disseminate false information. Without access to an API, researchers are unable to investigate how disinformation takes shape on a video-based platform and how user behaviours differ from that on other platforms. This lack of data also makes it challenging to develop new tools and strategies for combating disinformation and promoting digital literacy among TikTok users.

While TikTok has recently announced its intent to develop a researcher API to provide selective access to academics in the United States, the API's terms of service are not compatible with how researchers work. The requirement to refresh data every 15 days and delete data that is no longer available makes it challenging for researchers to conduct research on disinformation and hate speech effectively. Furthermore, researchers are obligated to report content that does not adhere to TikTok’s community guidelines, and if the content is deleted, researchers are required to delete it from their own dataset, rendering research on disinformation and hate speech almost impossible. Without access to an API and appropriate terms of service, it is difficult to hold TikTok accountable for its policies and efforts to control information manipulation on its platform.

Next Steps

Under the Code of Practice, platforms are obliged to report on their progress in fulfilling their commitments. In future activities on data access, we intend to monitor the implementation of the Code of Practice and developments under the DSA. We also plan to expand the database beyond VLOPs, to develop a scoring system assessing platforms´ approaches to data access.

Documents

Glossary Download

Tags: Policy brief Access://democracy Digital policy