The world’s leading publication for data science, AI, and ML professionals.

Termite Part One: A Response to Privacy Conundrums

A data science solution privacy policies that nobody actually reads

Why Privacy?

"If you have nothing to hide, you shouldn’t worry." This often heard argument is a fallacious attempt to justify and rationalize the compromising of one’s privacy. Further, I think it is bystander logic such as this that enables the powerful to incrementally encroach on your rights, until it’s too late.

In the final term of UC Berkeley School of Information’s Masters in Data Science (MIDS) program, students are tasked with building a Minimum Viable Product (MVP). Entering this final phase of graduate school, I felt optimistic about the opportunity to use the Data Science skills I have developed in the past year to advance my understanding of a worthy cause. The cause my team (Jennifer Patterson, Julian Pelzner, and Ollie Downs) and I chose to pursue was internet privacy.

My foray into privacy was fueled by my simultaneous employment with the Institute for Security and Technology. While conducting preliminary research on how digital platforms influence human cognition, and how that in turn impacts democratic institutions, I found myself constantly wondering why people trust online platforms so much. Conversely, why should we be skeptical about online forums? (Disclaimer: any mention of products/plug-ins designed by me are not affiliated with the Institute for Security and Technology, or any of their projects or programs, in any way.) While I remain unable to fully answer these questions, my involvement with my capstone project provided me with an opportunity to grapple with these issues.

In the midst of the global COVID-19 pandemic, we are spending more time online than ever before. Accordingly, data collection and monitoring is increasing as well. Debates on the tradeoffs of contact tracing and surveillance are unfolding before our eyes. On one side, contact tracing may help curtail the spread of the virus. On the other, how will governments use the same surveillance technology after we overcome the pandemic? Do we really believe governments will relinquish their privilege to use these advanced tools? Who is to say that institutions of power will not turn around and use similar technology to expel immigrants or identify and harass citizens expressing their first amendment rights of protest?

My 11th grade history teacher, Andy Owens, used to say, "privilege is never given up voluntarily." I think this continues to ring true in these uncertain times. The discussion of online Privacy could not be more relevant. This isn’t a new problem. For a more in-depth discussion of privacy risk aversion and privacy behavior, read It’s Safe with Us! The Science of Consumer Privacy by Alisa Frik and Matilda Ruck. Here are several intriguing findings mentioned in this article:

  1. Six in ten American adults feel it’s not possible to go through daily life without having their data collected, 81% thinking that the potential risks outweigh the benefits.
  2. 69% of people globally say they’d avoid doing business with a brand if its data usage was too invasive.
  3. 78% of people across the UK, US, France, and Germany say they’re protective of their financial information, compared to just 57% who say the same of their contact information.
  4. Just 9% of Americans always read a Privacy Policy before signing the Terms and Conditions (T&Cs), and 47% of people have expressed notification fatigue from messaging around the European Union’s General Data Privacy Regulation.
  5. 72% of Americans feel that almost all or most of what they do online or on their phone is being tracked by advertisers and tech firms.
  6. 48% of consumers who care about privacy have switched to companies or service providers because of their data policies and practices.
  7. 14% of Britons would be willing to share data if they were paid for it. But these pay-offs can also be circumstantial. Research from the University of York suggests that 54% of Britons are willing to sacrifice some of their data privacy to shorten the length of the lockdown. Moreover, 73% of people would be open to sharing data if a company was more transparent about how they use it.

We corroborated these statistics with our own user research, which revealed that 50% of respondents reported that they never read a website’s privacy policy, and none of our respondents said that they always read a policy. Additionally, when asked if they feel responsible on the internet, 67.2% of respondents said "no" or "I don’t know." Not only does this show us that internet users don’t feel that their safety needs are being met, it also shows us that many users just don’t know if their information is safe or not.

The problem we seek to address has to do with privacy policies, which ideally should be the place for users to feel reassured about the safe data collection practices of a company. However, this is rarely the case. The primary issue with privacy policies is that they are too long and filled with technical jargon and "legalese." Privacy policies are often written by lawyers seeking to insulate companies from litigation. They are rarely written to protect user interests and promote their awareness. More recently, it has been increasingly fashionable from a public relations standpoint to construct more transparent privacy policies. Nevertheless, these are fringe cases, and privacy policies remain difficult for the average internet user to read, understand, summarize, and analyze. The truth is that most people don’t read them, and unwittingly agree.

Termite

This is why we’re launching Termite: a free browser add-on tool that uses web scraping and NLP approaches to automate and scale the assessment and rating of online terms of service, terms and contracts, and privacy policies. Termite also provides users with customized cyber hygiene reports and empowers users to track their online agreements, all in real-time. We aim to optimize UI, awareness, and actionability. When privacy policies and Terms And Conditions are unreadable, Termite rates and tracks them. When users feel uncertain about their information safety, Termite empowers them. When websites take advantage of users not reading their policies, Termite lets users know, protecting them. Click here to view our brief demo!

Logo by Chris Rivas (@the_hidden_talents)
Logo by Chris Rivas (@the_hidden_talents)

There are some tools and services out there, all serving their own niche. They each do a great job in solving their respective problem spaces. As a whole, the landscape remains imperfect. There lacks a philosophical bridging between these disparate efforts, leaving opportunity to further connect the dots. Below are some existing solutions to our current privacy problems, and their respective shortcomings:

Mozilla Firefox

Mozilla Firefox is an open-sourced web browser with a demonstrated privacy-first ethos, and is the second largest browser by market share. In 2018, the combined income of Mozilla Foundation and Mozilla Corporation was $436 million. They currently represent 7.58% of the browser market. By comparison, Google Chrome is the largest browser, and currently represents 68.81% of the browser market. Mozilla Firefox fosters the embodiment of a company that practices what it preaches when it comes to privacy.

DuckDuckGo

DuckDuckGo is a search engine that automatically encrypts a user’s connection to a website when possible (so that your personal information can’t be collected or compromised), blocks ad trackers, and warns users about bad privacy practices. They are ranked the sixth largest search engine by market share, with over 50 million users. In 2018, they were valued between $10–50M valuation and they currently represent 1.36% of the search engine market. By comparison, Google is the largest search engine, and currently represents 70.37% of the market. DuckDuckGo represents the most comprehensive and successful solution out there.

Source: https://duckduckgo.com/traffic
Source: https://duckduckgo.com/traffic

Terms of Service; Didn’t Read (ToS;DR)

Terms of Service; Didn’t Read (ToS;DR) was formed to crowdsource online privacy policy analyses and privacy grades. It merits recognition as the best attempt to educate users about their online privacy. A fair criticism of ToS;DR, though, is that by crowdsourcing policy analyses and ratings, their website scores lack consistency. Moreover, their index of privacy topics spans 24 categories, far too many to be helpful and user-friendly. Further, their analyses are user-generated, meaning manually conducted by humans, resulting in a limited reach of site analyses.

Source: https://tosdr.org/topics.html#topics
Source: https://tosdr.org/topics.html#topics

Polisis

Polisis is a chrome extension that provides users with an AI-powered summary of any privacy policy. A unique way of visualizing privacy policies, Polisis utilizes deep learning and artificial intelligence to educate users on what data a company is collecting about you, what it is sharing, and much more. You don’t have to read the full privacy policy with all the legal jargon to understand what you are signing up for. In Polisis’ words, they argue that their product "enables scalable, dynamic, and multi-dimensional queries on natural language privacy policies. At the core of Polisis is a privacy-centric language model, built with 130K privacy policies, and a novel hierarchy of neural-network classifiers that accounts for both high-level aspects and fine-grained details of privacy practices." Polisis boasts having 1000+ users according to the Chrome Extension Store. They break their privacy categories down into the following 10: 1) First Party Collection, 2) Third Party Collection, 3) Access, Edit, Delete, 4), Data Retention 5) Data Security, 6) Specific Audiences, 7) Do Not Track, 8) Policy Change, 9) Other, and 10) Choice Control. Although Polisis is a neat tool that employs very intriguing data visualizations, my experience has been that it is slow and not all that user-friendly.

Privee

Privee is a proof of concept browser extension that seeks to automate the analysis of privacy policies. They boast that they have analyzed 38522 websites and counting. Privee has not been worked on since 2014, and only has 42 users on the google chrome extension store. To train their models, they relied on ToS;DR policy excerpt annotations. To develop and test their models, they used regular expressions to extract appropriate policy excerpts from new websites. The following six privacy categories that they cover include: 1) Collection, 2) Encryption, 3) Ad Tracking, 4) Limited Retention, 5) Profiling, and 6) Ad Disclosure.

The innovation of tools such as DuckDuckGo, and the relative success of crowdsourcing efforts (both financially and in the form of analytical services) marks the beginning of a trend. There is a growing demand for companies to design with privacy-minded principles. In the absence of far-enough-reaching regulation and compliance, academia, the private sector, and civil society have demonstrated that they are willing to chip away at the boulder that is the privacy problem. Nonetheless, more progress needs to be made in realms that have not been adequately covered yet.

Image by author
Image by author

This blog is a kickoff to a five part blog post. In "Termite Part Two: Model and Feature Choices," I discuss dilemmas our team faced, and the decisions that we inevitably made. Access parts three, four, and five in the hyperlinks.


Related Articles