Teresa Scassa - Blog

Displaying items by tag: data scraping

An interesting case from Quebec demonstrates the tension between privacy and transparency when it comes to public registers that include personal information. It also raises issues around ownership and control of data, including the measures used to prevent data scraping. The way the litigation was framed means that not all of these questions are answered in the decision, leaving some lingering public policy questions.

Quebec’s Enterprise Registrar oversees a registry, in the form of a database, of all businesses in Quebec, including corporations, sole corporations and partnerships. The Registrar is empowered to do so under the Act respecting the legal publicity of enterprises (ALPE), which also establishes the database. The Registrar is obliged to make this register publicly accessible, including remotely by technological means, and basic use of the database is free of charge.

The applicant in this case is OpenCorporates, a U.K.-based organization dedicated to ensuring total corporate transparency. According to its website, OpenCorporates has created and maintains “the largest open database of companies in the world”. It currently has data on companies located in over 130 jurisdictions. Most of this data is drawn from reliable public registries. In addition to providing a free, searchable public resource, OpenCorporates also sells structured data to financial institutions, government agencies, journalists and other businesses. The money raised from these sales finances its operations.

OpenCorporates gathers its data using a variety of means. In 2012, it began to scrape data from Quebec’s Enterprise Register. Data scraping involves the use of ‘bots’ to visit and automatically harvest data from targeted web pages. It is a common data-harvesting practice, widely used by journalists, civil society actors and researchers, as well as companies large and small. As common as it may be, it is not always welcome, and there has been litigation in Canada and around the world about the legality of data scraping practices, chiefly in contexts where the defendant is attempting to commercialize data scraped from a business rival.

In 2016 the Registrar changed the terms of service for the Enterprise Register. These changes essentially prohibited web scraping activities, as well as the commercialization of data extracted from the site. The new terms also prohibit certain types of information analyses; for example, they bar searches for data according to the name and address of a particular person. All visitors to the site must agree to the Terms of Service. The Registrar also introduced technological measures to make it more difficult for bots to scrape its data.

Opencorporates Ltd. C. Registraire des entreprises du Québec is not a challenge to the Register’s new, restrictive terms and conditions. Instead, because the Registrar also sent OpenCorporates a cease and desist letter demanding that it stop using the data it had collected prior to the change in Terms of Service, OpenCorporates sought a declaration from the Quebec Superior Court that it was entitled to continue to use this earlier data.

The Registrar acknowledged that nothing in the ALPE authorizes it to control uses made of any data obtained from its site. Further, until it posted the new terms and conditions for the site, nothing limited what users could do with the data. The Registrar argued that it had the right to control the pre-2016 data because of the purpose of the Register. It argued that the ALPE established the Register as the sole source of public data on Quebec businesses, and that the database was designed to protect the personal information that it contained (i.e. the names and addresses of directors of corporations). For example, it does not permit extensive searches by name or address. OpenCorporates, by contrast, permits the searching of all of its data, including by name and address.

The court characterized the purpose of the Register as being to protect individuals and corporations that interact with other corporations by assuring them easy access to identity information, including the names of those persons associated with a corporation. An electronic database gives users the ability to make quick searches and from a distance. Quebec’s Act to Establish a Legal Framework for Information Technology provides that where a document contains personal information and is made public for particular purposes, any extensive searches of the document must be limited to those purposes. This law places the onus on the person responsible for providing access to the document to put in place appropriate technological protection measures. Under the ALPE, the Registrar can carry out more comprehensive searches of the database on behalf of users who must make their request to the Registrar. Even then, the ALPE prohibits the Registrar from using the name or address of an individual as a basis for a search. According to the Registrar, a member of the public has right to know, once one they have the name of a company, with whom they are dealing; they do not have the right to determine the number of companies to which a physical person is linked. By contrast, this latter type of search is one that could be carried out using the OpenCorporates database.

The court noted that it was not its role to consider the legality of OpenCorporates’ database, nor to consider the use made by others of that database. It also observed that individuals concerned about potential privacy breaches facilitated by OpenCorporates might have recourse under Quebec privacy law. Justice Rogers’ focus was on the specific question of whether the Registrar could prevent OpenCorporates from using the data it gathered prior to the change of terms of service in 2016. On this point, the judge ruled in favour of OpenCorporates. In her view, OpenCorporates’ gathering of this data was not in breach of any law that the Registrar could rely upon (leaving aside any potential privacy claims by individuals whose data was scraped). Further, she found that nothing in the ALPE gave the Registrar a monopoly on the creation and maintenance of a database of corporate data. She observed that the use made by OpenCorporates of the data was not contrary to the purpose of the ALPE, which was to create greater corporate transparency and to protect those who interacted with corporations. She ruled that nothing in the ALPE obligated the Registrar to eliminate all privacy risks. The names and addresses of those involved with corporations are public information; the goal of the legislation is to facilitate digital access to the data while at the same time placing limits on bulk searches. Nothing in the ALPE prevented another organization from creating its own database of Quebec businesses. Since OpenCorporates did not breach any laws or terms of service in collecting the information between 2012 and 2016, nothing prevented it from continuing to use that information in its own databases. Justice Rogers issued a declaration to the effect that the Registrar was not permitted to prevent OpenCorporates from publishing and distributing the data it collected from the Register prior to 2016.

While this was a victory for OpenCorporates, it did not do much more than ensure its right to continue to use data that will become increasingly dated. There is perhaps some value in the Court’s finding that the existence of a public database does not, on its own, preclude the creation of derivative databases. However, the decision leaves some important questions unanswered. In the first place, it alludes to but offers no opinion on the ability to challenge the inclusion of the data in the OpenCorporates database on privacy grounds. While a breach of privacy argument might be difficult to maintain in the case of public data regarding corporate ownership, it is still unpredictable how it might play out in court. This is far less sensitive data that that involved in the scraping of court decisions litigated before the Federal Court in A.T. v. Globe24hr.com; there is a public interest in making the specific personal information available in the Registry; and the use made by OpenCorporates is far less exploitative than in Globe24hr. Nevertheless, the privacy issues remain a latent difficulty. Overall, the decision tells us little about how to strike an appropriate balance between the values of transparency and privacy. The legislation and the Registrar’s approach are designed to make it difficult to track corporate ownership or involvement across multiple corporations. There is rigorous protection of information with low privacy value and with a strong public dimension; with transparency being weakened as a result. It is worth noting that another lawsuit against the Register may be in the works. It is reported that the CBC is challenging the decision of the Registrar to prohibit searches by names of directors and managers of companies as a breach of the right to freedom of expression.

Because the terms of service were not directly at issue in the case, there is also little to go on with respect to the impact of such terms. To what extent can terms of service limit what can be done with publicly accessible data made available over the Internet? The recent U.S. case of hiQ Labs Inc. v. LinkedIn Corp. raises interesting questions about freedom of expression and the right to harvest publicly accessible data. This and other important issues remain unaddressed in what is ultimately an interesting but unsatisfying court decision.

 

Published in Privacy

Last year I attended a terrific workshop at UBC’s Allard School of Law. The workshop was titled ‘Property in the City’, and panelists presented work on a broad range of issues relating to law in the urban environment. A special issue of the UBC Law Review has just been published featuring some of the output of this workshop. The issue contains my own paper (discussed below and available here) that explores skirmishes over access to and use of Airbnb platform data.

Airbnb is a ‘sharing economy’ platform that facilitates the booking of short-term accommodation. The company is premised on the idea that many urban dwellers have excess space – rooms in homes or apartments – or have space they do not use at certain periods of the year (entire homes or apartments while on vacation, for example) – and that a digital marketplace can maximize efficient use of this space by matching those seeking temporary accommodation with those having excess space. The Airbnb web site claims that it “connects people to unique travel experiences at any price point” and at the same time “is the easiest way for people to monetize their extra space and showcase it to an audience of millions.”

This characterization of Airbnb is open to challenge. Several studies, including ones by the Canadian Centre for Policy Alternatives, the City of Vancouver, and the NY State Attorney General suggest that a significant number of units for rent on Airbnb are offered as part of commercial enterprises. The description also belies Airbnb’s disruptive impact. The re-characterization and commodification of ‘surplus’ private spaces neatly evades the regulatory frameworks designed for the marketing of short-term accommodation and leaves licensed short-term accommodation providers complaining that their highly regulated businesses are being undermined by competition from those not bearing the same regulatory burdens. At the same time, many housing advocates and city officials are concerned about the impact of platforms such as Airbnb on the availability and affordability of long-term housing.

These challenges are made more difficult to address by the fact that the data needed to understand the impact of platform companies, along with data about short-term rentals that would otherwise be captured through regulatory processes, are effectively privatized in the hands of Airbnb. Data deficits of this kind pose a challenge to governments, civil society and researchers..

My paper explores the impact of a company such as Airbnb on cities from the perspective of data. I argue that platform-based, short-term rental activities have a fundamental impact on what data are available to municipal governments who struggle to regulate in the public interest, as well as to civil society groups and researchers that attempt to understand urban housing issues. The impacts of platform companies are therefore not just disruptive of incumbent industries; they disrupt planning and regulatory processes by masking activities and creating data deficits. My paper considers some of the currently available solutions to the data deficits, which range from self-help type recourses such as data scraping to entering into data-sharing agreements with the platform companies. Each of these solutions has its limits and drawbacks. I argue that further action may be required by governments to ensure their data needs are adequately met.

Although this paper focuses on Airbnb, it is worth noting that the data deficits discussed in the paper are merely a part of a larger context in which evolving technologies shift control over some kinds of data from public to private hands. Ensuring the ability of governments and civil society to collect, retain, and share data of a sufficient quality to both enable and to enhance governance, transparency, and accountability should be priorities for municipal governments, and should also be supported by law and policy at provincial and federal levels.

 

 

Skirmishes over right to freely access and use “publicly available” data hosted by internet platform companies have led to an interesting decision from the U.S. District Court from the Northern District of California. The decision is on a motion for an interlocutory injunction, so it does not decide the merits of the competing claims. Nevertheless, it provides insight into a set of issues that are likely only to increase in importance as these rich troves of data are mined by competitors, opportunistic businesses, big data giants, researchers and civil society actors.

The parties in hiQ Labs Inc. v LinkedIn Corp. are companies whose business models are based upon career-related personal information provided by professionals. LinkedIn offers a professional networking platform to over 500 million users, and it is easily the leading company in its space. hiQ, for its part, is a data analytics company with two main products aimed at enterprises. The first is “Keeper”, a product which informs corporations about which of their employees are at greatest risk of being poached by other companies. The second is “Skill Mapper” which provides businesses with summaries of the skills of their employees. For both of its products hiQ relies on data that it scrapes from LinkedIn’s publicly accessible web pages.

Data featured on LinkedIn’s site are provided by users who create accounts and populate their profiles with a broad range of information about their background and skills. LinkedIn members have some control over the extent to which their information will be shared by others. They can choose to limit access to their profile information to only their close contacts or to an expanded list of contacts. Alternatively, they can provide access to all other members of LinkedIn. They also have the option to make their profiles entirely public. These public profiles are searchable by search engines such as Google. It is the data in the fully public profiles that is scraped and used by hiQ.

hiQ is not the only company that scrapes data from LinkedIn as part of an independent business model. In fact, LinkedIn has only recently attempted to take legal action against a large number of users of its data. hiQ was just one of many companies that received a cease and desist letter from LinkedIn. Because being cut off from the LinkedIn data would effectively decimate its business, hiQ responded by seeking a declaration from the California court that its activities were legal. The recent decision from the court is in relation to hiQ’s request for an interlocutory injunction that will allow it to continue to access the LinkedIn data pending resolution of the substantive legal issues raised by both sides.

hiQ argued that in moving against its data scraping activities, LinkedIn engaged in unfair business practices, and violated its free speech rights under the California constitution. LinkedIn, for its part, argued that hiQ’s data scraping activities violated the Computer Fraud and Abuse Act (CFAA), as well as the digital locks provisions Digital Millennium Copyright Act (DMCA) (although these latter claims do not feature in the decision on the interlocutory injunction).

Like other platform companies, access to and use of LinkedIn’s site is governed by website Terms of Service (TOS). These TOS prohibit data scraping. When LinkedIn demanded that hiQ cease scraping data from its site, it also implemented technological protection measures to prevent access by hiQ to its data. LinkedIn’s claims under the CFAA and the DMCA are based largely on the circumvention of these technological barriers by hiQ.

The court ultimately granted the injunction barring LinkedIn from limiting hiQ’s access to its publicly available data pending the resolution of the issues in the case. In doing so, it expressed its doubts that the CFAA applied to hiQ’s activity, noting that if it did, it would “profoundly impact open access to the Internet.” It also found that attempts by LinkedIn to block hiQ’s access might be in breach of state law as anti-competitive behavior. In reaching its decision, the court had some interesting things to say about the importance of access to publicly accessible data, and the privacy rights of those who provided the data. These issues are highlighted in the discussion below.

In deciding whether to grant an interlocutory injunction, a court must assess both the possibility of irreparable harm and the balance of convenience as between the parties. In this case, the court found that denying hiQ access to LinkedIn data would essentially put it out of business – causing it irreparable harm. LinkedIn argued that it was imperative that it be allowed to protect its data because of its users’ privacy interests. While hiQ only scraped data from public profiles, LinkedIn argued that even those users with public profiles had privacy interests. I noted that 50 million of its users with public profiles had selected its “Do Not Broadcast” feature which prevents profile updates from being broadcast to a user’s connections. LinkedIn described this as a privacy feature that would essentially be circumvented by routine data scraping. The court was not convinced. In the first place, it found that there might be many reasons besides privacy concerns that motivated users to choose “do not broadcast”. It gave as an example the concern by users that their connections not be spammed by endless notifications. The Court also noted that LinkedIn had its own service for professional recruiters that kept them apprised of updates even from users who had implemented “Do Not Broadcast”. The court dismissed arguments by LinkedIn that this was different because users had consented to such sharing in their privacy policy. The court stated: “It is unlikely, however, that most users’ actual privacy expectations are shaped by the fine print of a privacy policy buried in the User Agreement that likely few, if any, users have actually read.” [Emphasis in original] This is interesting, because the court discounts the relevance of a privacy policy in informing users’ expectations of privacy. Essentially, the court finds that users who make their profiles public have no real expectation of privacy in the information. LinkedIn could therefore not rely on its users’ privacy interests to justify its actions.

In assessing whether the parties raised serious questions going to the merits of the case, the court considered LinkedIn’s arguments about the CFAA. The CFAA essentially criminalizes intentional access to a computer without authorization, or in a way that exceeds the authorization provided, with the result that information is obtained. The question, therefore, was whether hiQ’s continued access to the LinkedIn site after LinkedIn expressly revoked any permission and tried to bar its access, was a violation of the CFAA. The court dismissed the cases cited by LinkedIn in support of its position, noting that these cases involved unauthorized access to password protected sites as opposed to accessing publicly available information.

The court observed that the CFAA was enacted largely to deal with the problem of computer hacking. It noted that if the application of the law was extended to publicly accessible websites it would greatly expand the scope of the legislation with serious consequences. The court noted that this would mean that “merely viewing a website in contravention of a unilateral directive from a private company would be a crime.” [Emphasis in original] It went on to note that “The potential for such exercise of power over access to publicly viewable information by a private entity weaponized by the potential of criminal sanctions is deeply concerning.” The court placed great emphasis on the importance of an open internet. It noted that “LinkedIn, here, essentially seeks to prohibit hiQ from viewing a sign publicly visible to all”. It clearly preferred an interpretation of the CFAA that would be limited to unauthorized access to a computer system through some form of “authentication gateway”.

The court also found that hiQ raised serious questions that LinkedIn’s behavior might fall afoul of competition laws in California. It noted that LinkedIn is in a dominant position in the field of professional networking, and that it might be leveraging its position to get a “competitively unjustified advantage in a different market.” It also accepted that it was possible that LinkedIn was denying its competitors access to an essential facility that it controls.

The court was not convinced by hiQ’s arguments that the technological barriers erected by LinkedIn violated the free speech guarantees in the California Constitution. Nevertheless, it found that on balance the public interest favoured the granting of the injunction to hiQ pending the outcome of litigation on the merits.

This dispute is extremely interesting and worth following. There are a growing number of platforms that host vast stores of publicly accessible data, and these data are often relied upon by upstart businesses (as well as established big data companies, researchers, and civil society) for a broad range of purposes. The extent to which a platform company can control its publicly accessible data is an important one, and one which, as the California court points out, will have important public policy ramifications. The related privacy issues – where the data is also personal information – are also important and interesting. These latter issues may be treated differently in different jurisdictions depending upon the applicable data protection laws.

Published in Privacy

Canadian Trademark Law

Published in 2015 by Lexis Nexis

Canadian Trademark Law 2d Edition

Buy on LexisNexis

Electronic Commerce and Internet Law in Canada, 2nd Edition

Published in 2012 by CCH Canadian Ltd.

Electronic Commerce and Internet Law in Canada

Buy on CCH Canadian

Intellectual Property for the 21st Century

Intellectual Property Law for the 21st Century:

Interdisciplinary Approaches

Purchase from Irwin Law