Predict a Reddit user's vulnerability to harassment

Web Social Privacy Reflection


The privacy issues in social networks lead to the popularity of anonymous networks like Reddit, Whisper and Secret. But many incidents in the past suggest that this blanket of protection may just be a puff of smoke. Anonymity can still leave footprints that can be followed by predators, since users tend to be more "candid". And these traces can help ill-willed individuals harass other users in these communities with whatever information they can dig up.
To increase the awareness of users on one such social network, Reddit, we conceived SnooKnows (named after Reddit’s extraterrestrial mascot). This is a simple web tool for Reddit users to quantify their vulnerability and preemptively build a defense against such harassing elements.

My Role

This project was a part of one of my undergraduate courses, Privacy and Security in Online Social Media. It was a heavy chunk of the course grade and took us the entire semester to build over. The entire arc stretched from finding a real problem to building a solution. While initially the team roles were more fluid, we eventually diverged into specialised roles when the problem was less nebulous.

User Research
I was arguably the most experienced Reddit user in the team, though eventually I had everyone hooked. This role included secondary research to find an actual issue and stretched onto writing the scripts involved in processing our information on the users for building the ground truth. Manual handpicking, to user follow ups and even designing the remote interactions came under this job. Drawing the themes that point to the problem is what the objective of this position is.
Design and Development
When it comes to the actual tool, I was responsible for sketching out the layout we were going to use present the information to our user. After which, my job evolved into the actual front end developer of the web tool. Scripting the charting visualisations and the kind of information we want to push to the user were my sections to work on.
Note: These roles are not necessarily exclusive to me nor are they exhaustive in describing everything I did in the project

The Problem

By the late 2000s and early 2010s the internet audience had been flooded with different social networks or online communities, each trying to offer something unique. While Facebook is arguably the most popular network, in the later years of this bubble, many anonymous networks started bubbling up. Such networks thrive on their ability to maintain a user's privacy and identity. Take for e.g. Reddit (the network we focused on in this project), it doesn't even need an email to signup. Just think of a username and password and you're good to go.

Anonymity ~ Candidness
Many of these networks don't need photos, or "about me" sections, or anything of that sort. Without such artifacts, one would assume it separates the online personalities from their offline, or real personality. The reason being, since users are not obliged to present their offline selves (for verification, validation or otherwise), they are playing make pretend on these communities.
However the truth is, for many users, these anonymous networks is where they can be themselves without the fear of persecution. The defensive shield these networks provide by their abstraction, allows these supposedly elusive individuals tell the truth without fear. And this candour, often times causes user's to slip and give away valuable details.
As David Carr of The New York Times recently wrote, the anonymity of Reddit “has the odd effect of prompting users to be very intimate and remarkably candid.”Agreed. But it is not mere anonymity that makes Reddit work so well.
From: State of the Web: Reddit, the world’s best anonymous social network (Digital Trends)
Hateful Elements Exist
Quite a few Reddit loyalist would claim the real spine of the community and what maintains order within the chaos of the web are the rules of Reddit, known within the community as "reddiquette".
Yet, another nuance to Reddit is localisation of these rules to the subreddits (or discussion boards) and the strict adherence to these. An example of this is the popular r/nosleep where nobody questions the legitimacy of the horror stories posted. This mentality has lead to the creation of hate subreddits or other subreddits that clearly thrive on character assassination. Many of these have been banned. But many of them can be formed quickly too. Given that users often use Reddit for help and expose their problems to a community, they put themselves at risk to negative attention from these fringe elements. This can lead to harassment of users.
A list of banned subreddits related to bodyshaming of fat indivduals
Redditors can be Tracked
Among the community members, one of the most infamous incidents of user getting tracked from their virtual profile to their real one was that u/violenacrez. This man is known to be "The Biggest Troll on the Web" and was known to moderate some extremely controversial subreddits.
Though, this is a case of investigative journalists tracking down a nefarious element on an anonymous network, there are other low-key incidents too. Many subreddits are dedicated to users talking about real incidents and or what they are like in real life. These discussions attract harassment. And one would believe that Reddit would downvote such comments into oblivion and protect the interests of the person in question. But the truth is, users can still feel threatened via personal messages, that are hidden from the community. There have been incidents of people leaving Reddit due to this feeling of insecurity.
From: The dark side of Reddit's GoneWild (The Daily Dot)

Simply put, this endangers the experience of users on Reddit which can actually lead to them quitting the website in totality. This takes a hit on Reddit's user retention but also creates a hole in the space a site like Reddit could fill in people's lives.

The Solution

To tackle this problem, we developed a web tool that pulls all the public data of a given Reddit user and analysis it. Natural Language Processing and Machine Learning techniques are applied to extract sensitive information from this bulk of data. Ultimately the user is presented a classification of how sensitive their data is based on kind of features extracted.

Our Process

1) Getting the Users
We floated a survey to collect the Reddit username handles within the our undergraduate institute, to obtain some ground truth. However, the response was understandably very low as most users are concerned about their privacy. Due to which few hundreds of active Reddit handles were manually handpicked by us from various popular subreddits. Only content generators and creators were selected rather than simple spectators. The reason for this was that we planned to planned to analyse the textual content posted by users as either fresh posts or comments. Since Reddit does not provide access to what users upvote, it was unreasonable to learn something about users from that information.
2) Obtaining Ground Truth
Once we had prepared a list, we planned to analyse the public information of these users and then follow up with them to validate our estimates. Initial analysis of the collected user profiles was via their an online Reddit analytics tool called SnoopSnoo. The results were then sent to the said users as private messages and asked to fill a form to verify the accuracy and authenticity of the results. These results were then used as ground truth for SnooKnows.
Note: At this point users were asked to provide consent for us to continue using their publicly available data
A sample of the direct message we used to follow up our ground truth with users
3) User Data Analysis
Now it was time to analyse the user data themselves. We employed a third party Natural Language Processing API, known as TextRazor to analyse various features from their publicly available comments. The API extracts various things like Entities and Relationships from the given text which are then used to get attributes like things user likes to discuss, user’s relationships and people in the family. The API also gives the broad list of topics based on the text based on a “contextual score”.
The results from our data are compared with the ground truth to gauge how accurate the tool we are trying to construct is.
The accuracy of our extractions based on the ground truth.
4) Data sensitivity analysis
Another survey was floated to analyse what kind of data, if gathered about anonymous users, could be sensitive and thereby potentially used against them for harassment. This survey was sent to the students in our institute and to the Reddit community for results. These were then combined at the end, for analysis, and used to come up with a vulnerability score for a Reddit user.
Based on the score they get, the potential of them being harassed on Reddit is what is the most important information to present to users. We also decide to provide users with links to the sensitive information they posted on Reddit so that they can track back and modify the content.
How vulnerable on a scale of 1-10 an average users feel if a predator learns of about one of the fields

Functional Prototype

Dynamic Data Fetch
Any reddit user can use this tool to understand their vulnerability. Due to latency issues, currently it fetches the top 100 comments by the user. However this can be extended to larger bulks as the processing stage becomes faster.
A sample of the direct message we used to follow up our ground truth with users
Vulnerability Score
The user is presented with a vulnerability score that was calculated based on the number of sensitive fields the tool could infer. A list of all these fields and the inferred value is also presented. Certain objective fields also have external links that a user can use to track back to the actual comment where that piece of information was revealed.
This score was determined based on a prior study where respondents rated how vulnerable they feel on a scale of 1-10, if someone learns about one of the fields
A radial meter with dynamic colour is used to draw attention to how vulnerable a user is
Subreddit Distribution
Sources of all the inferences this tool makes are comments. On Reddit, every comment belongs to a post that belongs to a subreddit. These subreddits are essentially topics used for discussion.
A distribution showing where maximum information is revealed can urge the user to modify their activity on that subreddit. More importantly, they can be more cautious when interacting with members of that community.
A donut chart representing where maximum vulnerable information is revealed
Most Vulnerable Combination
Once features have been extracted, the tool computes what are the possible combinations of features an attacker might learn. A prior study was used to have users rate which set of features at the hands of a predator would make them feel vulnerable.
We calculate what are the possible sets a predator can learn for this particular user. After which we present the percentage of each such combination.
The best way to understand this is, “50% of people feel vulnerable only when their family members and location is known.”
Showing how vulnerable they might be based on multiple pieces of information someone can retrieve from their profile
The User-Application flow of the tool


This tool is just a prototype and can do with a lot of improvements.

User Authentication
Naturally, only a given Reddit user may be allowed to check their vulnerability. Without implementing this, the tool becomes a weapon for predators who we are trying to protect users from in the first place.
Improvement of Vulnerability Score
Based on the studies, only the sensitivity of the field has been determined, but not of its value. The overall results indicate that sexuality is not a very sensitive feature. However, this may be biased because the sample size was comprised of heterosexual individuals.
Attack Tool
Currently this is a defence tool but its usefulness in offensive measures needs to be explored. This might have a future in investigating and hunting down trolls.

Beyond all this, there needs to be much more user testing for the tool and interface itself. The information visualisation may not be the best means of communicating the point of this tool to a user. These are the future challenges for SnooKnows.