Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Select Page

Mining of Massive Datasets

Mining of Massive Datasets

At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to ‘train’ a machine-learning engine of some sort. The principal topics covered are:

  • Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
  • Similarity search, including the key techniques of minhashing and localitysensitive hashing.
  • Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
  • The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.
  • Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.
  • Algorithms for clustering very large, high-dimensional datasets.
  • Two key problems for Web applications: managing advertising and recommendation systems.
  • Algorithms for analyzing and mining the structure of very large graphs, especially social-network graphs.
  • Techniques for obtaining the important properties of a large dataset by dimensionality reduction, including singular-value decomposition and latent semantic indexing.
  • Machine-learning algorithms that can be applied to very large data, such as perceptrons, support-vector machines, and gradient descent.

Mining of Massive Datasets

by Jure Leskovec, Anand Rajaraman, Jeff Ullman (PDF, PPT, Videos) – 12 chapters

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman