The GSoC Journey



About Content Holmes

Content Holmes project was started in Dec. 2016 to tackle the problem of increasing cyber bullying, depression and profane content exposure in children. For this, we built a content moderator that includes profanity replacement, website blocking, sentiment analysis, and depression detection. Content Holmes has one goal - to protect children when they use the internet. No need to monitor them anymore. Content Holmes does all the monitoring, blocking and filtering at the touch of a button.

Largely a student-driven organization, experts from psychology and computer science help us to keep Content Holmes up to date to the current needs of the users. Our aim is to achieve automation in content analysis and filtering using machine learning and artificial intelligence while keeping the algorithms efficient enough to run on client machines. Currently, Content Holmes is available as a web extension from the chrome web store and all its settings can be tweaked remotely via an NLP chatbot accessible from Facebook. The team is also working on developing a native desktop client and an Android app so that local content can be scanned for profane content.



Introducing yourself

First, please introduce yourself on our slack channel. This will help us gauge your interest and background. If you have questions about Content Holmes, feel free to ask them on #general channel, we are open to new ideas and suggestions.

You can start drafting your proposal on #proposals channel, where we will help you edit it and provide feedback.



General Requirements

You should have a Github account and a slack account on our workspace.
All generated code must be released on Github under an open source license. This licence may be either Affero GNU Public License or Apache-2.0.
There will be a brainstorming session each week during GSoC, after which contributing students will give a report detailing what was achieved in the past week. The contents of the report may vary according to each students’ project.
Most of all, you should find an area that you enjoy. This will help you learn quickly and enable you to provide amazing inputs in your project and the brainstorming sessions.



Programming Languages

Mostly javascript and node.js, however, in some places python may come in handy. Knowledge of data structures and algorithms with an understanding of either network programming and security or machine learning is desirable.






The Bucket List



Unit testing module for extension content scripts

Difficulty: Easy

Requirements: Familiarity with nodejs and testing frameworks such as mocha, sinon and chai.

Mentor: Rajat

Co-Mentor: Deepak


A testing suite has already been integrated into the Content Holmes extension. However, this suite only tests limited functionality of the extension and is by no means complete. Here, the applicants will design a modular testing suite that will mock web pages and run the content scripts to check their results. The extension has a lot of different modules doing a variety of different tasks and will thus need very comprehensive and complete test cases (which might also happen to be the most challenging part while implementing this idea). The current module already uses mocha, sinon and chai to carry out testing. These modules can be found at Github:ContentHolmes/Content-Holmes. This idea can be implemented in conjunction with developing unit testing for the chatbot (next idea).


Deliverables:

Mid-Term: Files effectively mocking webpages on chrome or any other browser, database calls and server requests.
Final: Full test cases implemented for all modules and content scripts on the extension.



Unit testing (and Continuous Integration) for chatbot

Difficulty: Easy

Requirements: Familiarity with nodejs and testing frameworks such as mocha, sinon and chai.

Mentor: Rajat

Co-Mentor: Deepak


When parents install Content Holmes on their children’s PCs, they need an easy and intuitive way to regulate the application’s behavior and receive reports. An NLP enabled chatbot hosted on facebook provides just this. However, currently there is no testing system built in for the bot, which means that developers either need to check each and every dialog to make sure that nothing is broken, or just wait for the server to report exceptions whenever a broken dialog is reached. Both of these situations are highly undesirable. Here, applicants will design a modular unit test module that mocks connection requests and tests all dialogs and libraries. The project can also be extended to allow continuous integration with the server using Jenkins or TravisCI.


Deliverables:

Mid-Term: Testing Module integrated into the bot. Tests should have been written for about 30% of the libraries.
Final: Unit testing full working on all the dialogs on the bot. The tester should follow the same project structure and naming convention as followed by the bot libraries. Using github:microsoftly/BotTester.



News Crawler for content services

Difficulty: Medium

Requirements: Familiarity with python, machine learning and database management.

Mentor: Rajat

Co-Mentor: Rohit


Content Holmes has a built in recommender system for gauging interests of its users and redirecting them to a news feed on their interests whenever a profane website is accessed. Currently, however, all the content retrieved from the user’s interest is fetched using Google’s Search API. The results from the API are general, and may not be related to what the user is seeking (Eg - Searching ‘python’ may yield news about a python snake). Therefore, a news crawler will be used in conjunction with our website classifier to better customize this news feed. Here, the applicants need to construct a crawler that can pick up news items from the web and classify them using an already existing website classifier. They will also need to design a scalable database management system to store information on these news articles.


Deliverables:

Mid-Term: Working crawler such that news articles are just fetched. No tagging and indexing required.
Final: Completed web crawler such that articles are fetched, tagged and indexed in real-time.



Website classifier

Difficulty: Medium

Requirements: Familiarity with python, machine learning, text processing.

Mentor: Anirudh

Co-Mentor: Rohit


Website classifier is used to restrict access to adult sites. The classifier analyzes contents (keywords, description, title, url, images, etc.) of the webpage and then classifies it into one of the predefined categories. If the website is classified as adult then the user is prevented from accessing it. The classification should be done in real time thus requiring the classifier to be optimized. The accuracy of the classifier should be acceptably high so as to prevent the non-adult sites from getting misclassified.


Deliverables:

Mid-Term: An accurate website classifier (Need not be optimized).
Final: An accurate website classifier that is optimized enough to classify in real time.



Image Classifier

Difficulty: Hard

Requirements: Familiarity with python, deep learning.

Mentor: Rohit

Co-Mentor: Anirudh


Image classification is crucial to block the adult content on sites. It also helps in website classification so that adult sites can be blocked automatically and learnt. Currently this is done using Microsoft APIs. But this task must be transferred to an image classifier made using deep learning architectures. The classifier should be fast with good accuracy so that classification can be done in real time. The classifier should be able to classify thumbnails (low resolution & small images) as well. A bloom filter is to be made based on the model which can analyse the image quickly so the main classifier will be run only if the filter reports the image as a adult one. An option to report inappropriate images must be added as well.


Deliverables:

Mid-Term: An accurate image classifier (Need not be optimized).
Final: An accurate image classifier that is fast enough in classifying images along with reporting mechanism and bloom filter. Must be integrated into website classifier as well.



Interest Determiner & Recommender

Difficulty: Hard

Requirements: Familiarity with python, machine learning, deep learning.

Mentor: Rohit

Co-Mentor: Anirudh


Just blocking adult content for a child will make the child try and access it in other ways possible ultimately leading to frustration. We solve this problem by using personalized content recommendation system thus suggesting content of his/her interest to the child. The child’s interests are analyzed using his/her search history. The analyzer must be adaptive in the sense that if the user dismisses an uninteresting content, the analyzer must adjust his/her interests accordingly. The model should be retrained on each user’s machine so that it is tailored to his/her browsing patterns. A recommender must be built on top of this analyzer which recommends related/ interesting topics which are then crawled by the news crawler. The recommendations should be real-time and hence the recommender must be optimized.


Deliverables:

Mid-Term: Interest Determination using searches & articles read.
Final: Recommender for recommending topics along with the interest determiner.



Sentiment Analyzer Improvements

Difficulty: Very Hard

Requirements: Familiarity with python, javascript, machine learning & neural networks.

Mentor: Mrigesh

Co-Mentor: Anirudh, Rohit


Content Holmes has a sentiment analyzer built in for understanding the child’s mental health. However, it is far from perfect. Currently, Sentiment Analyzer works by recognizing words in the text against a stored set of words that carry particular weights. This project will involve improving the sentiment analyzer so that it provides accurate and quick analysis of content a child views. The analyzer must also be improved to make it aware of the context of the sentences as well. A learning model needs to be introduced where the analyzer learns from the datasets provided and is finally able to run in real time before the web page completes loading. The model could be implemented in any language of the student’s choice. However, for the real time calculation of sentiment scores, the results of the model need to be exported to javascript to be able to run on the web browsers. The analysis of text should happen in real time.


Deliverables:

Mid-Term: Come up with a model to calculate sentiment scores of text.
Final: Export the model to js and optimize it to be able to run in real time.



Developing a desktop application which supports script injection

Difficulty: Very Hard

Requirements: Familiarity with python, javascript, nodejs.

Mentor: Mrigesh

Co-Mentor: Rajat


Currently, Content Holmes is a web extension. We would like to build a desktop application which will serve all the functionality of the web extension. It will work by injecting scripts into pages the child is viewing. By script injection, we mean to append a "<"script">" tag to the "<"Head">" tag of the web page being viewed. The script should be injected every time a new page is loaded or a page is reloaded or a new tab is opened and the injection procedure should be fast enough to be completed before the DOM has completed construction. At the start, the application should work with most of the famous web browsers.This can later be extended to all the web browsers. The application should be able to control various scripts to be injected depending on different parameters( for example - URL). Because the injected scripts need to communicate the collected data (depression scores, adult websites visited etc) back to the server, One needs to figure out a way to append the user information into the injected scripts(This is quite easy!). And finally, the application must be able to run as a background service.


Deliverables:

Mid-Term: Application should be able to inject scripts into browsers (Chrome, Firefox, Edge, Safari).
Final: Injected scripts should now carry user data for communication with the server and the application should run as a background service. The application should be able to exert control over which script is to injected based on the URL. By this time, the application should become browser independent.