This file explain a machine learning project dedicated to the enhancement of the 'thefiletree.com' : Feel free to add comments here, So i can take your ideas in account to make this project better! Why this project ? ------------------ I like using the filetree, i like having my files publics so i can give my friends the url or have random user just passing by to have a look, work with me, modifiy it for their own usage, enhance it, have fun, etc .. (the filetree story) But i want to rely on the fact that my files won't be lost because erased, vandalized and those kind of not cool user behaviors. I want be sure that i won't lose some of the work i spent time on without making local copy of the files on my computer: "in case of". Project description ------------------- The starting point is to work on the log files of thefiletree.com. I have 43 Mo of log files which represent around 1,000,000 lines. As a reminder "thefiletree.com" use operational transformations ( http://en.wikipedia.org/wiki/Operational_transformation ) Amoung thoses logs more than 100,000 are OTs. Unfortunately those logs are not good quality logs regarding what i want to learn from it. So the first part of the project will be to build usable datas from those logs trying to build agregations. Then the second step will be to label the data "regular behavior / bad behavior" : Jan Keromnes offered me some help if needed. I may use active learning technics, but it will depend on the time it take to label the data. But if it's not too long i'll prefert to label all the data with help of other users. And finally try some of machine learning algorithms we learn during my current lecture on the data so i can get some results. If the result are convincing, one of the two project leader is ok for doing all the enhancement needed on the log creation so it will contain much more useful informations for bad behaviors detection. Fallback project ---------------- As i'm not sure to have usable results from the logs a more easy project can also be useful: Find dead parts of the tree, ie: files and folders that are to be deleted. Users can't delete files for obvious reasons and administrator don't have time to spend on taking care about deleting the useless files and folders that are poluting thefiletree.com In order to work on that: the files names, metadatas, plugs and contents provide good quality inputs for doing some machine learning.