Previously, on “Machine Learning for everybody”, we talked about what machine learning is and why it’s nice to have. In this episode I’ll ramble on about how a computer can learn things, or rather how you may teach them.
Do androids learn form ebooks?
Before getting technical, it may be a good idea to take a look back at how we learn. How do humans even learn? I don’t think I can speak for all of human race, so I will talk about myself and hope I’m somewhat relatable. I may read a book, or watch a video, or go to class, or talk to someone more knowledgeable, accumulating examples of something and finding patterns in them, etc. We could think of those as forms of input that I can process in order to learn something.
That sounds more like a computery way of doing things, so let’s go with that. However, most of those methods don’t work for computers, since they don’t actually understand our language. Luckily, we can both find patterns, and that’s probably enough. It doesn’t matter if a computer doesn’t know what a face is, as long as it can find something in an image that’s similar to something we already marked as a face.
So, for them to learn we need to teach them. For us to teach something to a machine we’ll have to give it data, lots of it, or more precisely, a dataset. This is perhaps the key component of this whole thing, so I want to make sure you get the gist of it.
Consider the data gathered from a national census, in which for each individual you would have things like age, income, level of education, marital status, occupation, etc.
You would get one entry for each person, with different values for those variables to represent the person. Each of these instances is a piece of data. The variables (age, income, etc.) are what you’d call features of your data. So, indeed, once more in the history of computer science, this can be solved with a spreadsheet. A spreadsheet in which the columns are the features and each row represents what those values are for a particular case.
Okay, great, we have data! Now we can go somewhere. There are many ways in which the computer may use this data, I’m just going to briefly introduce the popular ones and expand on them later, and also mention some others for the sake of completeness.
This is the most used one. Here you not only give the computer your data set, you also mark the correct answer for each piece of your data.
For example, we have 3GB of photos of both faces and things that are not faces, and I want to train the computer to tell me if a photo has a human face or not. Then, I’m going to give it each photo telling it the answer for each one. With some set of algorithms the computer will hopefully find what to look for in a photo to determine if there is a face or not. Then, I’ll present it with images it’s never seen before and quiz it to see if it learned well. If it did, then great, otherwise we may have to see what went wrong.
The crucial element here is adding some tags to your dataset. Going back to thinking a dataset as a table, you just add extra columns containing the answer you expect from each row. Mind you, it could be more than one column, depending on what you want.
Something cool about this method is that since you know the answers, are able to know how well it learned. You put on a classic 80’s song and do a training montage of your prefered algorithm, and end up with a model that can answer stuff. Then you can test it out and see how many it answers correctly!
The way the “learning” is done varies from algorithm to algorithm, but one possibility would be the following:
- Memorize all the dataset, in this case all the labeled pictures that you know if are faces or not
- Whenever you are presented with a new photo, find the 5 images you know that look the most similar to the new one
- If the majority of the known similar images are of faces, then it’s a face, otherwise it’s not.
Not a particularly great or efficient method, but it’s a decently effective one. This is a simplification of an algorithm named N-Nearest Neighbors (in my example above, n=5), which is one of the most basic ones. We’d have to establish a criteria for when two images are similar and we’re set.
A more sensible algorithm, instead of memorizing the initial data may extract some patterns from the dataset features and use that instead. For example:
- Identify the most dividing features of the data. In the faces example, images that have two white circles close to each other with smaller dark circles inside (eyes) have more chance of being images with faces that those without it. Of course, there could be images with that characteristic that don’t contain faces, so after the eyes we’d check for the presence of a nose, or a mouth (or stuff that look like those).
- When presented a new image we first look for eyes. If it doesn’t then we decide it’s not a face. If it has eyes, we continue evaluating.
- Now we check for a nose. Then for lips. Then for eyebrows. You can go as deep as you want to, perhaps just with the eyes and the nose is good enough for you, maybe you need more, that’s up to you.
This is perhaps a little more clever than the method above. Just like before, this is a simplified Decision Tree. The success of this algorithm depends heavily on the quality of the futures of your data.
The last thing I want to talk about is how this algorithms could go wrong, and there’s two reasons: bad methodology and bad dataset. Methodology is a subject for the next time, but bad data set we can cover now. There are a few components for this reason. The first is the usefulness of the features. If you want to identify if an image contains a face by checking the amount of each color that appear in the image, you’ll have a bad time doing it reliably.
The second thing to look out for is to have good variety in your data set. If 95% of the images I use as input are all of faces, it won’t learn enough on the ‘not face’ case. Imagine if instead of trying to find out “has face yes/no” you tried to predict between 10 possible answers, having one of those very underrepresented would really hurt the learning, the same if one was very overrepresented, the results would get biased. It’s very important to have an unbiased dataset.
There are many other more complex algorithms commonly used in supervised learning that I won't try to explain, but I will mention some because they have cool names and if this post turns out to be unhelpful I want you to at least have a name for your band: Random Forests, Support Vector Machines, Naive Bayes, Neural Networks (not really an algorithm, more of a framework for learning, but serves a similar purpose, it's really powerful but requires a whole lot of whole lots of data to shine), etc. Look them up, it’s pretty cool stuff.
Before we knew what answers we expected from the machine. In the example, we wanted to know if a photo contains a face or not. Imagine we didn’t knew what to look for. We just have the data, so we give it to the computer unlabeled. The idea is that the computer will come up with its own conclusions about the data. Ideally it will get something out of the input that allows it to separate the data into clusters that could be then easier analyzed. You could find out which features were relevant in the making of the clusters, which features didn't make any difference, detect patterns that were hard to see before, make anomalies stand out, etc.
While this isn't as useful in solving problems as the supervised learning, it provides a useful exploratory analysis of the data, and helps determine which features are actually relevant, allowing you to represent the same data dropping features that didn't influence the clustering at all, resulting in a simpler representation of the same data, which you could use to perform a supervised training in the future.
A simple use case for this is a recommendation system. You could get people separated into clusters by their most bought products/watched series/musical preferences, etc. So then people that have all bought a dishwasher, or seen Seinfeld, or listened to Bach would be considered similar on that criteria and grouped together, and you can easily get that “People that liked this also enjoyed… “ based on what other people on the same cluster liked but you never considered.
There are many other techniques that can be used to machine teach, depending on the problem at hand and the available data, you may want to use one or the other, or even more than one. To mention a few, there’s Reinforcement Learning, Semi-supervised Learning, Multi-task Learning, Transfer Learning, Domain adaptation, Structured Prediction, One-shot learning, etc. Just know they exist, and look them up if they piqued your curiosity.
It goes without saying there’s plenty to be said about the techniques and algorithms I mentioned here, but that would be never-ending, so I’m gonna call it quits for now and hope I presented the basics in an understable way and maybe motivated you to learn more about some of it.
Next time I’ll go over the tools and methodology you could use to go about machine teaching, stay tuned. See ya!