SnooKnows
Predict a Reddit user's vulnerability to harassment
Introduction
The privacy issues in social networks lead to the popularity of anonymous networks like Reddit, Whisper and Secret. But many incidents in the past suggest that this blanket of protection may just be a puff of smoke. Anonymity can still leave footprints that can be followed by predators, since users tend to be more "candid". And these traces can help ill-willed individuals harass other users in these communities with whatever information they can dig up.
To increase the awareness of users on one such social network, Reddit, we conceived SnooKnows (named after Reddit’s extraterrestrial mascot). This is a simple web tool for Reddit users to quantify their vulnerability and preemptively build a defense against such harassing elements.
My Role
This project was a part of one of my undergraduate courses, Privacy and Security in Online Social Media. It was a heavy chunk of the course grade and took us the entire semester to build over. The entire arc stretched from finding a real problem to building a solution. While initially the team roles were more fluid, we eventually diverged into specialised roles when the problem was less nebulous.
The Problem
By the late 2000s and early 2010s the internet audience had been flooded with different social networks or online communities, each trying to offer something unique. While Facebook is arguably the most popular network, in the later years of this bubble, many anonymous networks started bubbling up. Such networks thrive on their ability to maintain a user's privacy and identity. Take for e.g. Reddit (the network we focused on in this project), it doesn't even need an email to signup. Just think of a username and password and you're good to go.
Many of these networks don't need photos, or "about me" sections, or anything of that sort. Without such artifacts, one would assume it separates the online personalities from their offline, or real personality. The reason being, since users are not obliged to present their offline selves (for verification, validation or otherwise), they are playing make pretend on these communities.
However the truth is, for many users, these anonymous networks is where they can be themselves without the fear of persecution. The defensive shield these networks provide by their abstraction, allows these supposedly elusive individuals tell the truth without fear. And this candour, often times causes user's to slip and give away valuable details.
As David Carr of The New York Times recently wrote, the anonymity of Reddit “has the odd effect of prompting users to be very intimate and remarkably candid.”Agreed. But it is not mere anonymity that makes Reddit work so well.
Quite a few Reddit loyalist would claim the real spine of the community and what maintains order within the chaos of the web are the rules of Reddit, known within the community as "reddiquette".
Yet, another nuance to Reddit is localisation of these rules to the subreddits (or discussion boards) and the strict adherence to these. An example of this is the popular r/nosleep where nobody questions the legitimacy of the horror stories posted. This mentality has lead to the creation of hate subreddits or other subreddits that clearly thrive on character assassination. Many of these have been banned. But many of them can be formed quickly too. Given that users often use Reddit for help and expose their problems to a community, they put themselves at risk to negative attention from these fringe elements. This can lead to harassment of users.
Among the community members, one of the most infamous incidents of user getting tracked from their virtual profile to their real one was that u/violenacrez. This man is known to be "The Biggest Troll on the Web" and was known to moderate some extremely controversial subreddits.
Though, this is a case of investigative journalists tracking down a nefarious element on an anonymous network, there are other low-key incidents too. Many subreddits are dedicated to users talking about real incidents and or what they are like in real life. These discussions attract harassment. And one would believe that Reddit would downvote such comments into oblivion and protect the interests of the person in question. But the truth is, users can still feel threatened via personal messages, that are hidden from the community. There have been incidents of people leaving Reddit due to this feeling of insecurity.
Simply put, this endangers the experience of users on Reddit which can actually lead to them quitting the website in totality. This takes a hit on Reddit's user retention but also creates a hole in the space a site like Reddit could fill in people's lives.
The Solution
To tackle this problem, we developed a web tool that pulls all the public data of a given Reddit user and analysis it. Natural Language Processing and Machine Learning techniques are applied to extract sensitive information from this bulk of data. Ultimately the user is presented a classification of how sensitive their data is based on kind of features extracted.
Our Process
We floated a survey to collect the Reddit username handles within the our undergraduate institute, to obtain some ground truth. However, the response was understandably very low as most users are concerned about their privacy. Due to which few hundreds of active Reddit handles were manually handpicked by us from various popular subreddits. Only content generators and creators were selected rather than simple spectators. The reason for this was that we planned to planned to analyse the textual content posted by users as either fresh posts or comments. Since Reddit does not provide access to what users upvote, it was unreasonable to learn something about users from that information.
Once we had prepared a list, we planned to analyse the public information of these users and then follow up with them to validate our estimates. Initial analysis of the collected user profiles was via their an online Reddit analytics tool called SnoopSnoo. The results were then sent to the said users as private messages and asked to fill a form to verify the accuracy and authenticity of the results. These results were then used as ground truth for SnooKnows.
Note: At this point users were asked to provide consent for us to continue using their publicly available data
Now it was time to analyse the user data themselves. We employed a third party Natural Language Processing API, known as TextRazor to analyse various features from their publicly available comments. The API extracts various things like Entities and Relationships from the given text which are then used to get attributes like things user likes to discuss, user’s relationships and people in the family. The API also gives the broad list of topics based on the text based on a “contextual score”.
The results from our data are compared with the ground truth to gauge how accurate the tool we are trying to construct is.
Another survey was floated to analyse what kind of data, if gathered about anonymous users, could be sensitive and thereby potentially used against them for harassment. This survey was sent to the students in our institute and to the Reddit community for results. These were then combined at the end, for analysis, and used to come up with a vulnerability score for a Reddit user.
Based on the score they get, the potential of them being harassed on Reddit is what is the most important information to present to users. We also decide to provide users with links to the sensitive information they posted on Reddit so that they can track back and modify the content.
Functional Prototype
Any reddit user can use this tool to understand their vulnerability. Due to latency issues, currently it fetches the top 100 comments by the user. However this can be extended to larger bulks as the processing stage becomes faster.
The user is presented with a vulnerability score that was calculated based on the number of sensitive fields the tool could infer. A list of all these fields and the inferred value is also presented. Certain objective fields also have external links that a user can use to track back to the actual comment where that piece of information was revealed.
This score was determined based on a prior study where respondents rated how vulnerable they feel on a scale of 1-10, if someone learns about one of the fields
Sources of all the inferences this tool makes are comments. On Reddit, every comment belongs to a post that belongs to a subreddit. These subreddits are essentially topics used for discussion.
A distribution showing where maximum information is revealed can urge the user to modify their activity on that subreddit. More importantly, they can be more cautious when interacting with members of that community.
Once features have been extracted, the tool computes what are the possible combinations of features an attacker might learn. A prior study was used to have users rate which set of features at the hands of a predator would make them feel vulnerable.
We calculate what are the possible sets a predator can learn for this particular user. After which we present the percentage of each such combination.
The best way to understand this is, “50% of people feel vulnerable only when their family members and location is known.”
Future
This tool is just a prototype and can do with a lot of improvements.
Naturally, only a given Reddit user may be allowed to check their vulnerability. Without implementing this, the tool becomes a weapon for predators who we are trying to protect users from in the first place.
Based on the studies, only the sensitivity of the field has been determined, but not of its value. The overall results indicate that sexuality is not a very sensitive feature. However, this may be biased because the sample size was comprised of heterosexual individuals.
Currently this is a defence tool but its usefulness in offensive measures needs to be explored. This might have a future in investigating and hunting down trolls.
Beyond all this, there needs to be much more user testing for the tool and interface itself. The information visualisation may not be the best means of communicating the point of this tool to a user. These are the future challenges for SnooKnows.