In many cases, the acquisition of well-labelled training data is a huge hurdle for developing accurate prediction systems with supervised learning.
At Love the Sales , we aggregate sales products from over 700 multinational retailers, which results in over 2 million products a day that need classification. It could take a traditional merchandising team four years to complete this task manually.
Our requirement was to apply this classification to the textual metadata of these 2 million products (mostly fashion and homewares) into 1000+ different categories - represented in a hierarchy, such as:
For our classification task, we decided upon Support Vector Machines (SVMs). SVMs are a class of supervised Machine Learning algorithm, proficient in the classification of linearly separable data.
Essentially, given a large enough set of labelled training data - an SVM will attempt to learn best discriminative plane between the examples - to try and draw a multidimensional line in the sand.
For example, here are some possible ways to separate this dataset:
The SVM will attempt to learn the optimal hyperplane:
Images from opencv.org
Whilst there are many Machine Learning algorithms for classification (Neural Networks, Random Forest, Naive Bayesian ), SVMs really shine for data with many features - in our case, document classification, wherein each ‘word’ is treated as a discrete feature.
Whilst SVMs can actually classify into multiple classes, we opted to use a hierarchy of simple two-class SVMs, chained together in a hierarchical fashion.
The main reason for this, is that when we tried it, it seemed to yield better results - and - importantly, it used a lot less memory on our machine learning platform because each SVM only had to know about two classes of data. High memory utilisation for large data sets (300k+ examples) and large input vectors (1m different known words) was a definite stumbling block for us.
Some simple well-known techniques for pre-processing our documents also helped a great deal in reducing the feature space, such as lowercasing, stemming , stripping odd characters and removing ‘noisy’ words and numbers.
Stemming is a common language-specific technique useful when dealing with large corpuses of textual data, with the goal of taking different words with a similar meaning and root, and ‘crunching’ them down to similar tokens. For example, the words ‘Clothing’ and ‘Clothes’ have a very similar meaning - when each are stemmed through the common ‘Porter’ stemming algorithm the result is ‘cloth’ - in doing this, we halved the number of words we have to worry about. Using stemming in conjunction with ‘Noisy word’ removal (the removal of common words without any domain meaning, e.g. The, Is, And, With… etc) we can hope to see large reduction in the number of words we have to deal with.
Once you have pre-processed your textual data, the next step is to train your model. In order to do this, you must first transform your textual data into a format the SVM can understand - this is known as ‘Vectorization’. A very quick simple description of this process may be as follows - take the sentence:
Men, you’ll look fantastic in this great pair of mens skinny jeans
After pre-processing it as described above with stemming and removing words we don’t care about:
men fantastic great pair men skinny jean
To take just the words from the above example, we can see we have one repeated word, so, we could encode the data like so:
This can be represented in a vector, such as:
This is fine for a small set of terms (only one short sample in my above case). However, as we add more samples and more terms our vocabulary increases, for example, if we add another training sample that isn’t mens skinny jeans:
women bootcut acid wash jean
We need to increase the overall vocabulary the algorithm must know about, e.g.
This means that our new term vector for the initial mens skinny jeans example has to change to:
When dealing with thousands of samples, your vocabulary can become large, and as a result can become quite cumbersome, as the encoded training samples become mostly empty, and very long:
[0,0,0,0,0,0,0,0,..... 2,0,0,0,0,0,.....1,0,0,0,0 …]
Thankfully, many machine learning libraries allow you to encode your term vectors as sparse vectors - which means you only need to supply non-zero cases, and the library (in our case LibSVM ) will magically figure it out for you, and fill in the gaps.
In such a case, you provide the term vectors, and additionally the classes they represent as ‘Term indices’ relative to the entire vocabulary for all the training samples you wish you use, e.g.
So, for “men fantastic great pair men skinny jean”, you would describe it as:
Term Index #2 : 1 Occurance Term Index #3 : 1 Occurance Term Index #4 : 1 Occurance Term Index #5 : 2 Occurrences Term Index #6 : 1 Occurrences Term Index #7 : 1 Occurrences
Which can then be encoded succinctly as:
Alexandre Kowalczyk has a great description of vocabulary preparation here , along with other great SVM tutorials.
A key learning for us, is that the way in which these SVM’s are structured can actually have a significant impact on how much training data has to be applied, for example, a naive approach would have been as follows:
[Click on the image to enlarge it]
This approach requires that for every additional sub-category, two new SVM’s be trained - for example, the addition of a new class for ‘Swimwear’ would require an additional SVM under Men's and Women's - not to mention the potential complexity of adding a ‘Unisex’ class at the top level. Overall, deep hierarchical structures can be too rigid to work with.
We were able to avoid a great deal of labelling & training work, by flattening our data structures into many sub-trees like so:
[Click on the image to enlarge it]
By decoupling our classification structure from the final hierarchy, it is possible to generate the final classification by traversing the SVM hierarchy with each document, and interrogating the results with simple set-based logic such as:
Mens Slim-fit jeans = (Mens and Jeans and Slim Fit) and not Womens
This approach vastly reduces the number of SVM’s required to classify documents, as the resultant sets can be intersected to represent the final classification.
It should also now be evident that adding new classes opens up an exponentially increasing number of final categories. For example - adding a top-level ‘Children's’ class - would immediately allow the creation of an entire dimension of new Children’s categories (children’s jeans, shirts, underwear, etc), with minimal additional training data (Only one additional SVM):
Because of the structure we chose, one key insight that we were able to leverage, was that of re-using training data, via linked data relationships. Linking data enabled us to re-use our training data by an overall factor of 9x - thus massively reducing the cost and increasing the accuracy of predictions.
For each individual class, we obviously want as many training-data examples as possible, covering both possible outcomes. Even though we built some excellent internal tooling, primarily a nice fast user interface for searching, sorting and labelling training data examples in large batches - labelling thousands of examples of each kind of product can still be laborious, costly and error-prone. We determined the best way to circumvent these issues was to attempt to reuse as much training data as we could, across classes.
For example, given some basic domain knowledge of the categories - we know for certain that ‘Washing machines’ can never be ‘Carpet cleaners’
By adding the ability to link ‘Exclude data’, we can heavily bolster the amount ‘Negative’ training examples for the ‘Washing machines’ SVM by adding to it the ‘Positive’ training data from ‘Carpet cleaners’ SVM. Put more simply, given that we know “Carpet cleaners can never be washing machines” - we may as well reuse that training data.
This approach has a nice uptick, in that whenever the need arises to add some additional training data to improve the ‘Carpet Cleaners’ SVM - it inadvertently improves the ‘Washing machines’ class, via linked negative data.
Finally, another chance for reuse, that is apparent when considering a hierarchy, is that the positive training data for any child nodes, is also always positive training data for its parent.
For example: ‘Jeans’ are always ‘Clothing’.
This means that for every positive example of training data added to the ‘Jeans’ SVM - an additional positive example is also added to the ‘Clothing’ SVM via the link.
Adding linked data is far more efficient than manually labelling 1000’s of examples.
We feel that Support Vector Machines have helped us to reach a quality and speed of classification that we could have never achieved with a non-machine learning approach. As such, we have come to learn that Support Vector Machines are an excellent addition to any developers toolbox, and that any investigation should also serve as a nice introduction to some key machine learning concepts.
Additionally, when it comes to the specifics of hierarchical classification systems, decoupling the classification component from the resulting hierarchy, flattening the data structure and enabling the reuse of training data will all be beneficial in gaining as much efficiency as possible. The approaches outlined above have not only helped reduce the amount of training data we needed to label, it has also given us greater flexibility overall.
Originally studying Computer Science in New Zealand, now London based, David Bishop lead the technical team at top 100 UK employment website reed.co.uk for 10 years, and has now gone on to found his own retail technology business LoveTheSales.com - which seeks to aggregate all of the sales from 1000’s of retail websites.