Blog Archive

10 September 2015

Quotes on uncertain questions: Le Stat and le statistical science

In consideration of the Georg Cantor's view about the importance of asking the right questions: 

“Very few beings really seek knowledge in this world. Mortal or immortal, few really ask. On the contrary, they try to wring from the unknown the answers they have already shaped in their own minds -- justifications, confirmations, forms of consolation without which they can't go on. To really ask is to open the door to the whirlwind. The answer may annihilate the question and the questioner.” 
― Anne RiceThe Vampire Lestat

but,

“Il n'est pas certain que tout soit incertain.
(Translation: It is not certain that everything is uncertain.)” 
― Blaise PascalPascal's Pensees

And so it is with quantum computing, a theoretic framework standing surely on the shoulders of the pioneers of quantum mechanics and statistical mechanics and which opens doors to answering ever more computationally complex algorithms. Mathematical abstraction is truly effective for finding the right questions to ask. 
- DLW

30 August 2015

Mindscapes

On a visit to the Mathematical Sciences Research Institute
by Diane Wilcox, Aug 2015





The MSRI is an excellent environment for mathematical scientists in every respect and so it is a privilege to visit. The workshops over the past two weeks have been centred round two big areas of mathematics, with overlapping communities, and cover recent developments of common interest in a thematic programme on DETERMINISTIC DYNAMICS AND RANDOMNESS IN HIGH AND INFINITE DIMENSIONAL SYSTEMSThe main motivating phenomena under consideration stem from mathematical physics, with applications in almost all technological domains. Therefore, the research fields are mature and buy-in for novelty with respect to underlying concepts is high. Nevertheless, there is no shortage of open mathematical questions at the cutting edge.



The workshop in the first week provided platform for women and the confident, enthusiastic and lucid presentations by young researchers were awesome. While numbers for women in the mathematical and natural sciences  are still disproportionately low, in the US and across the globe, there is clearly progress in the right direction and the workshop continues to prove that women are capable of disciplined research and deep insights. Challenging social and intellectual norms is non-trivial and social anthropology continues to address issues of cultural development - the topic of women researchers in so-called `genius fields' is slowly becoming less obfuscated.





For my own part, the visit is a refreshing short sabbatical break from the routines of teaching,  project supervision, email and admin. There are numerous overlaps between the research domains of the programme and the different areas of mathematics where I have worked. In some respects there are more ideas to cogitate, while in others, it is great to sharpen connections and clarify insights. The shift back to starting from abstract axiomatic  perspectives rather than considering domain specific relevance of mathematical models definitely keeps the  fun in functional analysis. At the same time, domain specific insight and breadth of knowledge helps with asking key questions.




I am staying in a room rented via airbnb, which is a cost-effective service that is being supported by academic communities. The house is quiet, but quirky and the chickens are allowed to roam near my basement room window in the early morning.




When I was about 13, I thought bridge design was amazing (it was the early 80's and an academic career in the mathematical sciences happened to be completely off radar!). I enjoyed reading a library book on engineering with the Golden Gate bridge on the cover - I look forward to visiting this icon when I get to fit in some sight-seeing. So far the bridge has mostly been shrouded in fog across the bay from the hillside view of the institute and finally became more visible on some sunnier days last week.

 
Acknowledgements: I thank the MSRI and workshop organisers for facilitating the thematic programme and thank my co-authors, colleagues and students for more than a decade of thoughtful discussions and debates. This trip has been funded by the NRF of SA (the particular grant is based on my research proposal of 2013). I also thank the School of CSAM at Wits University and members of the AMF teaching team for facilitating a short sabbatical break. I would not be visiting the MSRI at this particular time without the support of an amazing, multi-tasking co-parent - I thank him and my family for allowing me time to reflect, read, write and do some advanced arithmetic. 

15 August 2015

Favourite math quotes

Joseph Fourier: "Mathematical analysis is as extensive as nature itself; it defines all perceptible relations, measures times, spaces, forces, temperatures; this difficult science is formed slowly, but it preserves every principle which it has once acquired; it grows and strengthens itself incessantly in the midst of the many variations and errors of the human mind. It's chief attribute is clearness; it has no marks to express confused notations. It brings together phenomena the most diverse, and discovers the hidden analogies which unite them."

Georg Cantor: The art of asking the right questions in mathematics is more important than the art of solving them.

remark on Cantor's observation: Asking key questions focusses ones attention, but to ask the right questions, you need to know quite a bit about how existing problems

22 July 2015

Two great losses



Below is a link to the blog by Alain Connes on the passing of Uffe Haagerup. I had the good fortune of meeting Prof Haagerup in 2002. His paper, Random matrices, free probability and the invariant subspace problem relative to a von Neumann Algebra, served as encouragement for me to study random matrices and application to denoising correlation matrix estimators for financial markets (and complex systems more generally) - it was a pivot for a switch from theory to applications of functional analysis at that time.


  








Thanks for the generous and gentle way in which you communicated with novices.

--
http://noncommutativegeometry.blogspot.com/2015/07/two-great-losses.html: Two great losses
It is with incommensurable sadness that we learned of the death of two great figures of the fields of operator algebras and mathematical physics.
Daniel Kastler died on July 4-th in his house in Bandol.
Uffe Haagerup died on July 5-th in a tragic accident while swimming near his summer house in Denmark.

---
Here is Prof Connes' blog on Haagerup: http://noncommutativegeometry.blogspot.com/2015/07/uffe-haagerup.html

Uffe Haagerup was a wonderful man, with a perfect kindness and openness of mind, and a mathematician of incredible power and insight. 
His whole career is a succession of amazing achievements and of decisive and extremely influential contributions to the field of operator algebras, C*-algebras and von Neumann algebras. 
His first work (1973-80) concerned the theory of weights and more generally the modular theory of Tomita-Takesaki. Uffe  Haagerup began by solving a key open question, showing that semi-finite weights are indeed supremum of families of normal states. This allowed him to develop the standard form of von Neumann algebras, a basic result used over and over since then.  I remember vividly his first appearance in the field of operator algebras and the striking elegance, clarity and strength of his contributions.
He was the first to introduce the Lp-spaces associated to a type III von Neumann algebra. These new spaces have since then received a lot of attention both from operator algebraists and specialists of Banach spaces and operator spaces.
In a remarkable paper published in Inventiones at the end of the seventies, Haagerup was able to analyse the operator norm in the C*-algebra of the free group, to control it by suitable Sobolev norms in spite of the exponential growth of the group and to prove in particular that in spite of its lack of nuclearity the reduced C*-algebra has the Banach space approximation property. This turned out in the long run to be a breakthrough of major importance. It was extended by Haagerup and his collaborators to any discrete subgroup of a simple real rank one Lie-group. One corollary is the property RD of rapid decay which allows one to define the analogue of the Harish-Chandra algebra of smooth elements in the general context of Gromov hyperbolic groups.  This highly non-trivial result of Haagerup turned out to be the technical key in the proof of the Novikov conjecture for such groups. Moreover the study of discrete groups with the Haagerup property continues to play a major role in geometric group theory.  The associated approximation property for the corresponding factors of type II_1 (called the Haagerup property) also played a major role in the solution by Popa of the long-standing problem of factors N non-isomorphic to M_2(N), and of exhibiting a factor with trivial fundamental group. 
The next fundamental contribution of Haagerup is to the classification of injective factors (1983-87). In my work on the subject I had left one case completely open (in 76) and the classification was thus incomplete. After several years of extremely hard work Haagerup was able to settle this question by proving that there is up to isomorphism only one hyperfinite factor of type III_1. This is a wonderful achievement.
Then Haagerup turned to the theory of subfactors, and, once again, was able to make fundamental contributions such as his construction of new irreducible subfactors of small index > 4, and outclass the best specialists of the subject created by Vaughan Jones at the beginning of the eighties. 
Another key contribution of Haagerup is to Voiculescu's Free Probability and to random matrices which he was able to apply very successfully to the theory of C*-algebras and of factors of type II1. For instance he proved that, in all II1 factors fulfilling the approximate embedding property (which is true in all known cases) every operator T with a non-trivial Brown measure has a non-trivial closed invariant subspace affiliated with the von Neumann algebra generated by T. 
Uffe continued at the same relentless pace to produce amazing contributions and the few mentioned above only give a glimpse of his most impressive collection of breakthrough achievements.
Uffe Haagerup was a marvelous mathematician, well-known to operator algebraists, and to the general community of analysts.  From my own perspective an analyst is characterized by the ability of having ”direct access to the infinite” and Uffe Haagerup possessed that quality to perfection. His disparition is a great loss for all of us. 

16 July 2015

Books to read

http://theadvisorcambodia.com/2015/06/all-for-nought/

All for nought

Posted On Jun 20, 2015
The concept of ‘zero’ is perhaps the most paradoxical theory of mankind. At once nothing and everything, it is the basis of the numerical system incorporated into most Westernised societies.  Amir D. Aczel investigates its origins in his latest book, Finding Zero.
Finding ZeroThe concept of zero, nought, or 0, is considered to be one of the highest intellectual achievements of mankind, and almost certainly the greatest conceptual leap in the history of mathematics.
As science writer and mathematician Amir Aczel puts it, “zero is not only a concept of nothingness, which allows us to do arithmetic well, and to algebraically define negative numbers… zero enables our base-10 number system to work, so that the same 10 numerals can be used over and over again, at different positions in a number.”
Finding Zero describes Aczel’s lifelong quest to discover from whence our number system came.
In the West, the Roman numeral system was used up until around the 13th century, yet didn’t have a zero, and it was incredibly cumbersome. Before that, the Sumerians and Babylonians counted in base 60 (which is still in use today for telling time and measuring angle), but also lacked a zero. The Mayan civilization of Central America used a glyph for a zero in some of their more complicated calendars, but its use varied, and it never made it out of Central America, so it can’t be related to “our” zero.
Aczel’s quest for the first zero takes him around the globe. Along the way, he encounters Indiana Jones-esque artefacts with splendid names, like the Ishango Bone, the Aztec Stone of the Sun, the Bakhshali Manuscript and the Nana Ghat Inscriptions. He travels to Thailand, Laos and Vietnam, and to India, where he studies a stone inscription found at the Chatturbhuja temple in Gwalior, which has been dated to 876 AD, and was long thought to be the planet’s first zero.
Eventually, his research leads him to Cambodia in search of an inscription originally discovered by French scholar George Coedès in 1929. The stone was found among the ruins of a 7th century temple at Sambor-on-Mekong, in present-day Kratie.
The key phrase on the stone is a date marker: “The Chaka era reached 605 in the year of the waning moon,” and can be dated to 683 AD, making it a full two centuries earlier than the Indian zero.
The stone with the zero (actually a dot), known prosaically as K-127, was moved to Phnom Penh, then to Siem Reap in 1969, when it disappeared. Many thought that the Khmer Rouge, with their hatred of culture and learning, had destroyed it. Nevertheless, Aczel tracks it down, rescues it from some entertainingly stupid Italian archaeologists, and sees it consigned, quite properly, to the National Museum (it’s worth noting, however, that K-127 is not on display at the museum, and no one seems to have much of a clue about where it is).
Along the way, Aczel considers whether Buddhist philosophical concepts about being and nothingness could have pushed Eastern thinkers toward developing the concept of zero, and decides that they probably did.
Aczel has written some 20 books on scientific topics. His prolific output and range of interests might make some of his mistakes in Finding Zero forgivable (e.g. Siem Reap does not translate to “Siam Victorious,” and you’d have had to have lived a very sheltered life to describe Phnom Penh’s National Museum as “one of the finest museums in the world”). And while it’s unlikely that “writes like a mathematician” will ever be seen as a huge compliment, Finding Zero is at least clearly and efficiently written, and tells an engaging and important story.

03 July 2015

Women solving challenges


Here are links to 2 interesting and inspiring videos  on women in chess and a link to scholarships for women from Africa and developing countries:


http://www.bbc.com/news/world-africa-21351317
Making the move from slum child to chess champion
6 February 2013 
Phiona Mutesi grew up in the slums of Kampala, Uganda, but her life changed when she walked into a local chess club which was offering free meals. Ten years on she is one of the best young female chess players in Africa.


http://www.bbc.com/news/world-us-canada-32618139
Girl, 11, is youngest US chess master
12 May 2015
For centuries chess has been a small game for big thinkers. Each player gets just 16 pieces to play on a board with 64 little squares. The numbers start out simple, but the outcomes are virtually limitless. No surprise, then, that it takes most people years to get any good - let alone become what the chess world calls a "master". Well, Carissa Yip has upended all those assumptions - for a start she's only 11 years old, and the word master doesn't really fit.


-----

2016 List of Scholarships for African women and Developing Countries

http://www.afterschoolafrica.com/255/international-scholarship-for-women-in/


12 June 2015

The Future of Computers Is the Mind of a Toddler

Facebook and Google are trying to create artificial intelligence that mimics the human brain. First, they need to figure out how our own minds work
June 9, 2015
Photographer: Adrian Lourie/Evening Standard/Redux

Machines contain the breadth of human knowledge, yet they have the common sense of a newborn. The problem is that computers don't act enough like toddlers. Yann LeCun, director of artificial intelligence research at Facebook, demonstrates this by standing a pen on the table and then holding his phone in front of it. He performs a sleight of hand, and when he picks the phone up—ta-da! The pen is gone. It’s a trick that’ll elicit a gasp from any one-year-old child, but today's cutting-edge artificial intelligence software—and most months-old babies—can’t appreciate that the disappearing act isn’t normal. “Before they’re a few months old, you play this trick on them, and they don’t care,” says LeCun, a 54-year-old father of three. “After a few months, they figure out this is not normal.”

One reason to love computers is that, unlike many kids, they do as they’re told. Just about everything a computer is capable of was put there by a person, and they've rarely been able to discover new techniques or learn on their own. Instead, computers rely on scenarios created by software programmers: If this happens, then do that. Unless it's explicitly told that pens aren't supposed to disappear into thin air, a computer just goes with it. The big piece missing in the crusade for the thinking machine is to give computers a memory that works like the grey gunk in our own heads. An AI with something resembling brain memory would be able to discern the highlights of what it sees, and use the information to shape its understanding of things over time. To do that, the world's top researchers are rethinking how machines store information, and they're turning to neuroscience for inspiration.

This change in thinking has spurred an AI arms race among technology companies such as Facebook, Google, and China’s Baidu. They’re spending billions of dollars to create machines that may one day possess common sense and to help create software that responds more naturally to users’ requests and requires less hand-holding. A facsimile of biological memory, the theory goes, should let AI not only spot patterns in the world, but reason about them with the logic we associate with young children. They’re doing this by pairing brain-aping bits of software, known as neural networks, with the ability to store longer sequences of information, inspired by the long-term memory component of our brain called the hippocampus. This combination allows for an implicit understanding of the world to get “fried in” to the patterns computers detect from moment to moment, says Jason Weston, an AI researcher at Facebook. On June 9, Facebook plans to publish a research paper detailing a system that can chew through several million pieces of data, remember the key points, and answer complicated questions about them. A system like this might let a person one day ask Facebook to find photos of themselves wearing pink at a friend's birthday party, or ask broader, fuzzier questions, like whether they seemed happier than usual last year, or appeared to spend more time with friends.

While AI has long been an area of interest for Hollywood and novelists, companies hadn't paid much attention to it until about five years ago. That's when research institutions and academics, aided by new techniques for crunching reams of data, started breaking records in speech recognition and image analysis at an unexpected rate. Venture capitalists took notice and invested $309.2 million in AI startups last year, a twentyfold increase from 2010, according to research firm CB Insights. Some of these startups are helping to break new ground. One in Silicon Valley, called MetaMind, has developed improvements to computers' understanding of everyday speech. Clarifai, an AI startup in New York, is doing complex video analysis and selling the service to businesses.
Corporate research labs now rival those in academia in terms of staffing and funding. They have surpassed them in access to proprietary data and computing power to run experiments on. That's attracting some of the field's most prominent researchers. LeCun, former director of New York University's Center for Data Science, joined Facebook in December 2013 to run its AI group. While still teaching a day a week at NYU, he has hired nearly 50 researchers; on June 2, Facebook said it is opening an AI lab in Paris, its third such facility. Google says its own AI team numbers in the “hundreds,” declining to be more specific. Baidu's Silicon Valley AI lab opened in May 2014, and now has around 25 researchers led by Andrew Ng, a former AI head at Google. The Chinese search giant employs about 200 AI specialists globally. The interest from deep-pocketed consumer Internet companies kickstarted a research boom creating “one of the biggest advances” in decades, says Bruno Olshausen, head of the Redwood Center for Theoretical Neuroscience at the University of California-Berkeley. “The work going on in these labs is unprecedented in the novelty of the research, the pioneering aspect.”

As far as tech money has pushed AI in recent years, computers are still pretty dumb. When talking to friends in a loud bar, you pick up what they're saying, based on context and what you remember about their interests, even if you can't hear every word. Computers can't do that. “Memory is central to cognition,” says Olshausen. The human brain doesn't store a complete log of each day's events; it compiles a summation and bubbles up the highlights when relevant, he says. Or at least, that's what scientists think. The problem with trying to create AI in our own image is that we don't fully comprehend how our minds work. “From a neuroscience perspective, where we are in terms of our understanding of the brain—and what it takes to build an intelligent system—is kind of pre-Newton,” Olshausen says. “If you're in physics and pre-Newtonian, you're not even close to building a rocket ship.”

Modern AI systems analyze images, transcribe texts, and translate languages using a system called neural networks, inspired by the brain's neocortex. Over the past year, virtually the entire AI community has begun shifting to a new approach to solve tough-to-crack problems: adding a memory component to the neuron jumble. Each company uses a different technique to accomplish this, but they share the same emphasis on memory. The speed of this change has taken some experts by surprise. “Just a few months ago, we thought we were the only people doing something a bit like that," says Weston, who co-authored Facebook's first major journal article about memory-based AI last fall. Days later, a similar paper appeared from researchers at Google DeepMind.
Since then, AI equipped with a sort of short-term memory has helped Google set records in video and image analysis, as well as create a machine that can figure out how to play video games without instructions. (They share more in common with kids than you probably thought.) Baidu has also made significant strides in image and speech recognition, including answering questions about images, such as, “What is in the center of the hand?” IBM says its Watson system can interpret conversations with an impressive 8 percent error rate. With Facebook's ongoing work on memory-based AI software that can read articles and then intelligently answer questions about their content, the social networking giant aims to create “a computer that can talk to you,” says Weston. The next step is to create an accompanying framework more akin to long-term memory, which could lead to machines capable of reasoning, he explains.

If a talking, learning, thinking machine sounds a little terrifying to you, you're not alone. “With artificial intelligence, we're summoning the demon,” Elon Musk said last year. The chief executive officer of Tesla Motors has a team working on AI that will let its electric cars drive themselves. Musk is also an investor in an AI startup named Vicarious. After some apparent self-reflection, Musk donated $10 million to the Future of Life Institute, an organization set up by Massachusetts Institute of Technology Professor Max Tegmark and his wife Meia to spur discussions about the possibilities and risks associated with AI. The organization brings together the world's top academics, researchers, and experts in economics, law, ethics, and AI to discuss how to develop brainy computers that will give us a future bearing more resemblance to The Jetsons than to Terminator. “Little serious research has been devoted to the issues, outside of a few small nonprofit institutes. Fortunately, this is now changing,” Stephen Hawking, who serves on the institute's advisory board, said at a Google event in May. “The healthy culture of risk assessment and the awareness of societal implications is beginning to take root in the AI community.”

AI teams from competing companies are working together to advance research, with an eye toward doing so in a responsible way. The field still operates with an academic fervor reminiscent of the early days of the semiconductor industry—sharing ideas, collaborating on experiments, and publishing peer-reviewed papers. Google and Facebook are developing parallel research schemes focused on memory-based AI, and they're publishing their papers to free academic repositories. Google's cutting-edge Neural Turing Machine “can learn really complicated programs” without direction and can operate them pretty well, Peter Norvig, a director of research at Google, said in a talk in March. Like a cubicle dweller wrestling with Excel, the machine makes occasional mistakes. And that's all right, says Norvig. “It's like a dog that walks on its rear legs. Can you do it at all? That's the exciting thing.”

Technological progress within corporate AI labs has begun to make its way back to universities. Students at Stanford University and other schools have built versions of Google's AI systems and published the source code online for anyone to use or modify. Facebook is fielding similar interest from academics. Weston delivered a lecture at Stanford on May 11 to more than 70 attending students—with many more tuning in online—who were interested in learning more about Facebook's Memory Networks project. LeCun, the Facebook AI boss, says, “We see ourselves as not having a monopoly on good ideas.” LeCun co-wrote a paper in the science journal Nature on May 28, along with Google's Geoff Hinton and University of Montreal Professor Yoshua Bengio, saying memory systems are key to giving computers the ability to reason about the world by adjusting their understanding as they see things change over time.

To illustrate how people use memory to respond to an event, LeCun grabs his magic pen from earlier and tosses it at a colleague. A machine without a memory or an understanding of time doesn't know how to predict where an object will land or how to respond to it. People, on the other hand, use memory and common sense to almost instinctively catch or get out of the way of what's coming at them. Weston, the Facebook AI researcher, watches the pen arc through the air and then hit him in the arm. “He's terrible at catching stuff,” LeCun says, with a laugh, “but he can predict!” Weston reassures him: “I knew it was going to hit me.”


29 May 2015

on screen

 I was recently roped in to participate in a teenage-audience programme on studying science and women in STEM.  YOTV rolled out the red lipstick and mascara for a novel studio experience. The Q&A was slick and over so quickly. Here are the 5 segments which made up the episode screened on SABC1 on 26 May 2015 - I appear in segments YOTV2 and YOTV4. 

18 May 2015

from rayli.net: Top 10 data mining algorithms in plain English


http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english/?utm_content=bufferac5dc&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer

Top 10 data mining algorithms in plain English

from 

rayli.net



Top 10 Data Mining Algorithms
Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.
Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.
What are we waiting for? Let’s get started!
Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I’ve incorporated into the post.

1. C4.5

What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.
Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.
What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO2max, family history, etc. These are called attributes.
Now:
Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won’t get cancer. C4.5 is told the class for each patient.
And here’s the deal:
Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.
Cool, so what’s a decision tree? Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:
  • Patient has a history of cancer
  • Patient is expressing a gene highly correlated with cancer patients
  • Patient has tumors
  • Patient’s tumor size is greater than 5cm
The bottomline is:
At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.
Is this supervised or unsupervised? This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn’t learn on its own that a patient will get cancer or won’t get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.
You might be wondering how C4.5 is different than other decision tree systems?
  • First, C4.5 uses information gain when generating the decision tree.
  • Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
  • Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
  • Finally, incomplete data is dealt with in its own ways.
Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.
Where is it used? A popular open-source Java implementation can be found over at OpenToxOrange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.
Classifiers are great, but make sure to checkout the next algorithm about clustering…

2. k-means

What does it do? k-means creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.
Hang on, what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such that the group members are more similar versus non-group members. Clusters and groups are synonymous in the world of cluster analysis.
Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster analysis, these would be called observations. We know various things about each patient like age, pulse, blood pressure, VO2max, cholesterol, etc. This is a vector representing the patient.
Look:
You can basically think of a vector as a list of numbers we know about the patient. This list can also be interpreted as coordinates in multi-dimensional space. Pulse can be one dimension, blood pressure another dimension and so forth.
You might be wondering:
Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?
Want to know the best part?
You tell k-means how many clusters you want. K-means takes care of the rest.
How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.
At a high level, they all do something like this:
  1. k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
  2. Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
  3. What we have are k clusters, and each patient is now a member of a cluster.
  4. k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
  5. This center becomes the new centroid for the cluster.
  6. Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
  7. Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.
Is this supervised or unsupervised? It depends, but most would classify k-means as unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to. k-means can be semi-supervised.
Why use k-means? I don’t think many will have an issue with this:
The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and more efficient than other algorithms, especially over large datasets.
It gets better:
k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k  and explore whether there are overlooked patterns or relationships in the dataset.
It’s not all smooth sailing:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.
Where is it used? A ton of implementations for k-means clustering are available online:
If decision trees and clustering didn’t impress you, you’re going to love the next algorithm…

3. Support vector machines

What does it do? Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.
Whoa, a hyper-what? A hyperplane is a function like the equation for a line, y = mx + b. In fact, for a simple classification task with just 2 features, the hyperplane can be a line.
As it turns out…
SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions…
…SVM figures out the best hyperplane which separates your data into the 2 classes.
Do you have an example? Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren’t too mixed together, you could take a stick and without moving the balls, separate them with the stick.
You see:
When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.
What do the balls, table and stick represent? The balls represent data points, and the red and blue color represent 2 classes. The stick represents the simplest hyperplane which is a line.
And the coolest part?
SVM figures out the function for the hyperplane.
What if things get more complicated? Right, they frequently do. If the balls are mixed together, a straight stick won’t work.
Here’s the work-around:
Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air.
You might be wondering if this is cheating:
Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.
How does SVM do this? By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we’re in 3 dimensions, the hyperplane must be a plane rather than a line.
I found this visualization super helpful:
Reddit also has 2 great threads on this in the ELI5 and ML subreddits.
How do balls on a table or in the air map to real-life data? A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.
Here’s the deal:
If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.
The bottomline is:
SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.
Margins are often associated with SVM? What are they? The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.
The key is:
SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.
Where does SVM get its name from? Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.
Is this supervised or unsupervised? This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.
Why use SVM? SVM along with C4.5 are generally the 2 classifiers to try first. No classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel selection and interpretability are some weaknesses.
Where is it used? There are many implementations of SVM. A few of the popular ones are scikit-learnMATLAB and of course libsvm.
The next algorithm is one of my favorites…

4. Apriori

What does it do? The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.
What are association rules? Association rule learning is a data mining technique for learning correlations and relations among variables in a database.
What’s an example of Apriori? Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.
shopping database
Here’s the best part:
By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.
The power of this is:
You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.
For example:
You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely what Apriori helps with!
You might be wondering how Apriori works? Before getting into the nitty gritty of algorithm, you’ll need to define 3 things:
  1. The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
  2. The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
  3. The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.
The basic Apriori algorithm is a 3 step approach:
  1. Join. Scan the whole database for how frequent 1-itemsets are.
  2. Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.
  3. Repeat. This is repeated for each itemset level until we reach our previously defined size.
Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships.
But wait, there’s more…
Apriori can also be modified to do classification based on labelled data.
Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.
On the other hand…
The algorithm can be quite memory, space and time intensive when generating itemsets.
Where is it used? Plenty of implementations of Apriori are available. Some popular ones are the ARtoolWeka, and Orange.
The next algorithm was the most difficult for me to understand, look at the next algorithm…

5. EM

What does it do? In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.
In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.
OK, hang on while I explain…
I’m not a statistician, so hopefully my simplification is both correct and helps with understanding.
Here are a few concepts that will make this way easier…
What’s a statistical model? I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model.
Wait, what’s a distribution? A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade.
In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade.
Cool, what are the parameters of  a model? A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance.
Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100.
So, all you need to describe a normal distribution are 2 parameters:
  1. The mean
  2. The variance
And likelihood? Going back to our previous bell curve example… suppose we have a bunch of grades and are told the grades follow a bell curve. However, we’re not given all the grades… only a sample.
Here’s the deal:
We don’t know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades.
In other words, given a set of measurable outcomes, let’s estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood.
Remember, it’s the hypothetical probability of the existing grades, not the probability of a future grade.
You’re probably wondering, what’s probability then?
Using the bell curve example, suppose we know the mean and variance. Then we’re told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability.
In more general terms, given the parameters, let’s estimate what outcomes should be observed. That’s what probability does for us.
Great! Now, what’s the difference between observed and unobserved data? Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.).
Here’s the kicker:
For data mining and clustering, what’s important to us is looking at the class of a data point as missing data. We don’t know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering.
Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now.
The best part is…
By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points — sounds like clustering to me!
How does EM help with clustering? EM begins by making a guess at the model parameters.
Then it follows an iterative 3-step process:
  1. E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
  2. M-step: Update the model parameters based on the cluster assignments from the E-step.
  3. Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).
Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning.
Why use EM? A key selling point of EM is it’s simple and straight-forward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data.
This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it’s possible to reason about what the clusters have in common and which cluster new data belongs to.
EM is not without weaknesses though…
  • First, EM is fast in the early iterations, but slow in the later iterations.
  • Second, EM doesn’t always find the optimal parameters and gets stuck in local optima rather than global optima.
Where is it used? The EM algorithm is available in Weka. R has an implementation in the mclust package. scikit-learn also has an implementation in its gmm module.
What data mining does Google do? Take a look…

6. PageRank

What does it do? PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.
Yikes.. what’s link analysis? It’s a type of network analysis looking to explore the associations (a.k.a. links) among objects.
Here’s an example: The most prevalent example of PageRank is Google’s search engine. Although their search engine doesn’t solely rely on PageRank, it’s one of the measures Google uses to determine a web page’s importance.
Let me explain:
Web pages on the World Wide Web link to each other. If rayli.net links to a web page on CNN, a vote is added for the CNN page indicating rayli.net finds the CNN web page relevant.
And it doesn’t stop there…
rayli.net’s votes are in turn weighted by rayli.net’s importance and relevance. In other words, any web page that’s voted for rayli.net increases rayli.net’s relevance.
The bottom line?
This concept of voting and relevance is PageRank. rayli.net’s vote for CNN increases CNN’s PageRank, and the strength of rayli.net’s PageRank influences how much its vote affects CNN’s PageRank.
What does a PageRank of 0, 1, 2, 3, etc. mean? Although the precise meaning of a PageRank number isn’t disclosed by Google, we can get a sense of its relative meaning.
And here’s how:
Pank Rank Table
You see?
It’s a bit like a popularity contest. We all have a sense of which websites are relevant and popular in our minds. PageRank is just an uber elegant way to define it.
What other applications are there of PageRank? PageRank was specifically designed for the World Wide Web.
Think about it:
At its core, PageRank is really just a super effective way to do link analysis.The objects being linked don’t have to be web pages.
Here are 3 innovative applications of PageRank:
  1. Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.
  2. Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.
  3. Bin Jiang, from The Hong Kong Polytechnic University, used a variant of PageRank to predict human movement rates based on topographical metrics in London.
Is this supervised or unsupervised? PageRank is generally considered an unsupervised learning approach, since it’s often used to discover the importance or relevance of a web page.
Why use PageRank? Arguably, the main selling point of PageRank  is its robustness due to the difficulty of getting a relevant incoming link.
Simply stated:
If you have a graph or network and want to understand relative importance, priority, ranking or relevance, give PageRank a try.
Where is it used? The PageRank trademark is owned by Google. However, the PageRank algorithm is actually patented by Stanford University.
You might be wondering if you can use PageRank:
I’m not a lawyer, so best to check with an actual lawyer, but you can probably use the algorithm as long as it doesn’t commercially compete against Google/Stanford.
Here are 3 implementations of PageRank:
  1. C++ OpenSource PageRank Implementation
  2. Python PageRank Implementation
  3. igraph – The network analysis package (R)

7. AdaBoost

What does it do? AdaBoost is a boosting algorithm which constructs a classifier.
As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element  belongs to.
But what’s boosting? Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.
What’s the difference between a strong and weak learner? A weak learner classifies with accuracy barely above chance. A popular example of a weak learner is the decision stump which is a one-level decision tree.
Alternatively…
A strong learner has much higher accuracy, and an often used example of a strong learner is SVM.
What’s an example of AdaBoost? Let’s start with 3 weak learners. We’re going to train them in 10 rounds on a training dataset containing patient data. The dataset contains details about the patient’s medical records.
The question is…
How can we predict whether the patient will get cancer?
Here’s how AdaBoost answers the question…
In round 1: AdaBoost takes a sample of the training dataset and tests to see how accurate each learner is. The end result is we find the best learner.
In addition, samples that are misclassified are given a heavier weight, so that they have a higher chance of being picked in the next round.
One more thing, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there’s just 1 learner).
In round 2: AdaBoost again attempts to look for the best learner.
And here’s the kicker:
The sample of patient training data is now influenced by the more heavily misclassified weights. In other words, previously misclassified patients have a higher chance of showing up in the sample.
Why?
It’s like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.
Likewise, the first learner likely classified some patients correctly. Instead of trying to classify them again, let’s focus all the efforts on getting the misclassified patients.
The best learner is again weighted and incorporated into the ensemble, misclassified patients are weighted so they have a higher chance of being picked and we rinse and repeat.
At the end of the 10 rounds: We’re left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds.
Is this supervised or unsupervised? This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.
Why use AdaBoost? AdaBoost is simple. The algorithm is relatively straight-forward to program.
In addition, it’s fast! Weak learners are generally simpler than strong learners. Being simpler means they’ll likely execute faster.
Another thing…
It’s a super elegant way to auto-tune a classifier, since each successive AdaBoost round refines the weights for each of the best learners. All you need to specify is the number of rounds.
Finally, it’s flexible and versatile. AdaBoost can incorporate any learning algorithm, and it can work with a large variety of data.
Where is it used? AdaBoost has a ton of implementations and variants. Here are a few:
If you like Mr. Rogers, you’ll like the next algorithm…

8. kNN

What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.
What’s a lazy learner? A lazy learner doesn’t do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.
On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.
How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.
Here’s why:
  1. C4.5 builds a decision tree classification model during training.
  2. SVM builds a hyperplane classification model during training.
  3. AdaBoost builds an ensemble classification model during training.
So what does kNN do? kNN builds no such classification model. Instead, it just stores the labeled training data.
When new unlabeled data comes in, kNN operates in 2 basic steps:
  1. First, it looks at the k closest labeled training data points — in other words, the k-nearest neighbors.
  2. Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.
You might be wondering…
How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric like Euclidean distance. The choice of distance metric largely depends on the data. Some even suggest learning a distance metric based on the training data. There’s tons more details and papers on kNN distance metrics.
For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:
  1. Using Hamming distance as a metric for the “closeness” of 2 text strings.
  2. Transforming discrete data into binary features.
These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:
How does kNN classify new data when neighbors disagree? kNN has an easy time when all neighbors are the same class. The intuition is if all the neighbors agree, then the new data point likely falls in the same class.
I’ll bet you can guess where things get hairy…
How does kNN decide the class when neighbors don’t have the same class?
2 common techniques for dealing with this are:
  1. Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.
  2. Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly what we want!
Is this supervised or unsupervised? This is supervised learning, since kNN is provided a labeled training dataset.
Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN. Depending on the distance metric, kNN can be quite accurate.
But that’s just part of the story…
Here are 5 things to watch out for:
  1. kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.
  2. Noisy data can throw off kNN classifications.
  3. Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.
  4. Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.
  5. Selecting a good distance metric is crucial to kNN’s accuracy.
Where is it used? A number of kNN implementations exist:
Spam? Fuhgeddaboudit! Read ahead to learn about the next algorithm…

9. Naive Bayes

What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption:
Every feature of the data being classified is independent of all other features given the class.
What does independent mean? 2 features are independent when the value of one feature has no effect on the value of another feature.
For example:
Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight, height and zip code. All features would be independent if the value of all features have no effect on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are independent, since a patient’s height has little to do with their zip code.
But let’s not stop there, are the other features independent?
Sadly, the answer is no. Here are 3 feature relationships which are not independent:
  • If height increases, weight likely increases.
  • If cholesterol level increases, weight likely increases.
  • If cholesterol level increases, pulse likely increases as well.
In my experience, the features of a dataset are generally not all independent.
And that ties in with the next question…
Why is it called naive? The assumption that all features of a dataset are independent is precisely why it’s called naive — it’s generally not the case that all features are independent.
What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named after. You can click on the link to find about more about Bayes’ Theorem.
In a nutshell, the theorem allows us to predict the class given a set of features using probability.
The simplified equation for classification looks something like this:
P(\textit{Class A}|\textit{Feature 1}, \textit{Feature 2}) = \dfrac{P(\textit{Feature 1}|\textit{Class A}) \cdot P(\textit{Feature 2}|\textit{Class A}) \cdot P(\textit{Class A})}{P(\textit{Feature 1}) \cdot P(\textit{Feature 2})}
Let’s dig deeper into this…
What does the equation mean? The equation finds the probability of Class A given Features 1 and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.
The equation reads: The probability of Class A given Features 1 and 2 is a fraction.
  • The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.
  • The fraction’s denominator is the probability of Feature 1 multiplied by the probability of Feature 2.
What is an example of Naive Bayes? Below is a great example taken from a Stack Overflow thread.
Here’s the deal:
  • We have a training dataset of 1,000 fruits.
  • The fruit can be a Banana, Orange or Other (these are the classes).
  • The fruit can be  Long, Sweet or Yellow (these are the features).
Fruit Probabilities
What do you see in this training dataset?
  • Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
  • Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
  • Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.
If we are given the length, sweetness and color of a fruit (without knowing its class), we can now calculate the probability of it being a banana, orange or other fruit.
Suppose we are told the unknown fruit is long, sweet and yellow.
Here’s how we calculate all the probabilities in 4 steps:
Step 1: To calculate the probability the fruit is a banana, let’s first recognize that this looks familiar. It’s the probability of the class Banana given the features Long, Sweet and Yellow or more succinctly:
P(Banana|Long, Sweet, Yellow)
This is exactly like the equation discussed earlier.
Step 2: Starting with the numerator, let’s plug everything in.
  • P(Long|Banana) = 400/500 = 0.8
  • P(Sweet|Banana) = 350/500 = 0.7
  • P(Long|Banana) = 450/500 = 0.9
  • P(Banana) = 500 / 1000 = 0.5
Multiplying everything together (as in the equation), we get:
0.8 \times 0.7 \times 0.9 \times 0.5 = 0.252
Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.
Step 4: Do a similar calculation for the other classes:
  • P(Orange|Long, Sweet, Yellow) = 0
  • P(Other|Long, Sweet, Yellow) = 0.01875
Since the 0.252 is greater than 0.01875, Naive Bayes would classify this long, sweet and yellow fruit as a banana.
Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables.
Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple arithmetic. It’s just tallying up counts, multiplying and dividing.
Once the frequency tables are calculated, classifying an unknown fruit just involves calculating the probabilities for all the classes, and then choosing the highest probability.
Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to be effective for spam filtering.
Where is it used? Implementations of Naive Bayes can be found in Orangescikit-learnWeka and R.
Finally, check out the 10th algorithm…

10. CART

What does it do? CART stands for classification and regression trees.  It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier.
Is a classification tree like a decision tree? A classification tree is a type of decision tree. The output of a classification tree is a class.
For example, given a patient dataset, you might attempt to predict whether the patient will get cancer. The class would either be “will get cancer” or “won’t get cancer.”
What’s a regression tree? Unlike a classification tree which predicts a class, regression trees predict a numeric or continuous value e.g. a patient’s length of stay or the price of a smartphone.
Here’s an easy way to remember…
Classification trees output classes, regression trees output numbers.
Since we’ve already covered how decision trees are used to classify data, let’s jump right into things…
How does this compare with C4.5?
C4.5CART
Uses information gain to segment data during decision tree generation.Uses Gini impurity (not to be confused with Gini coefficient). A good discussion of the differences between the impurity and coefficient is available on Stack Overflow.
Uses a single-pass pruning process to mitigate over-fitting.Uses the cost-complexity method of pruning. Starting at the bottom of the tree, CART evaluates the misclassification cost with the node vs. without the node. If the cost doesn’t meet a threshold, it is pruned away.
The decision nodes can have 2 or more branches.The decision nodes have exactly 2 branches.
Probabilitically distributes missing values to children.Uses surrogates to distribute the missing values to children.
Is this supervised or unsupervised? CART is a supervised learning technique, since it is provided a labeled training dataset in order to construct the classification or regression tree model.
Why use CART? Many of the reasons you’d use C4.5 also apply to CART, since they are both decision tree learning techniques. Things like ease of interpretation and explanation also apply to CART as well.
Like C4.5, they are also quite fast, quite popular and the output is human readable.
Where is it used? scikit-learn implements CART in their decision tree classifier. R’s tree package has an implementation of CART. Weka and MATLAB also have implementations.

Interesting Resources

Now it’s your turn…

Now that I’ve shared my thoughts and research around these data mining algorithms, I want to turn it over to you.
  • Are you going to give data mining a try?
  • Which data mining algorithms have you heard of but weren’t on the list?
  • Or maybe you have a question about an algorithm?
Let me know what you think by leaving a comment below right now.