Advances in Financial Machine Learning


Two of the most talked-about topics in modern finance are machine learning and quantitative finance. Both of these are addressed in a new book, written by noted financial scholar Marcos Lopez de Prado, entitled Advances in Financial Machine Learning.

In this book, Lopez de Prado strikes a well-aimed karate chop at the naive and often statistically overfit techniques that are so prevalent in the financial world today. But Lopez de Prado does more than just expose the mathematical and statistical sins of the field. Instead, he presents a technically sound roadmap for those who want to do state-of-the-art work in the field.

One particularly refreshing feature is that unlike many other treatments, this book emphasizes real-world empirical data analysis, as opposed to theoretical treatments that look pretty on paper but are often ineffective in practice. Secondly, as many readers will attest, the field of real-world quantitative finance is plagued by “knowledge hoarding.” In contrast, this work is completely open, offering considerable nuts-and-bolts instruction on how to implement truly effective techniques.

Overview of technical material

After a very nicely written introduction, Lopez de Prado presents, one by one, many of the techniques involved in real-world financial machine learning. Here is a short summary of the technical material:

  1. Data structures: Different datatypes, basic analytics, weights and sampling, labeling and fractionally differentiated features.
  2. Modeling: Error types, bagging versus boosting, cross-validation, feature extraction and hyper-parameter tuning.
  3. Backtesting: Dangers of backtesting, backtesting through cross-validation, backtesting on synthetic data, backtest statistics, strategy risk and machine learning asset allocation.
  4. Useful financial features: Structural breaks, entropy features and microstructural features.
  5. High-performance computing: Parallelization, single-thread versus multithreading versus multiprocessing, quantum computing and high-performance computational intelligence and forecasting technologies (the last chapter in this section is co-authored by Kesheng Wu and Horst Simon of the Lawrence Berkeley National Laboratory).

Other interesting commentary

The introductory section includes some very interesting insights. For example, Lopez de Prado recalls meetings he has attended with discretionary portfolio managers. Such gatherings typically proceed with each manager obsessively discussing one particular technique or item of anecdotal information; indeed, this “silo” approach is by design, since general managers typically want numerous different insights rather than a team consensus. However, this approach typically leads to “disaster” when applied to state-of-the-art quantitative and/or machine learning projects.

Instead, Lopez de Prado suggests that such work follow the example of large government laboratories, such as the Lawrence Berkeley National Laboratory in the U.S. and its equivalents elsewhere. In such organizations, theoretical researchers work with applied researchers and mathematicians, who in turn work with experimental scientists who gather data on state-of-the-art experimental facilities, and all of the above work with computer scientists to analyze data and run simulations on large high-performance computer systems. The author asks what has a higher chance of success, “this proven paradigm of organized collaboration” or “the Sisyphean alternative of having every single quant rolling their immense boulder up the mountain?”


It should be emphasized that this book is not for beginners, and is definitely not bedside reading. To begin with, the book presumes a fairly strong familiarity with modern statistical techniques. In addition, as the author states clearly in the introduction, the exposition presumes that the reader is familiar with the Python coding language, as well as at least some of the common statistical and machine learning packages, such as scikit-learn (sklearn), pandas, numpy, spicy, multiprocessing and matplotlib. Code examples are presented throughout the book, and it is presumed that the reader will try to implement at least a few of these examples.

In this sense, the book is more of a workbook than a scholarly treatise, in that it does not attempt to break new ground in revolutionary techniques, but instead presents techniques already in the literature and focuses how they can be directly applied to real-world finance work. The book is appropriate either for quantitative professionals who wish to learn effective techniques for machine learning, or else as a textbook for classroom instruction.

The book underscores the fact that one way or the other, the finance world is moving into highly mathematical, highly statistical and highly compute-intensive directions. Let’s face it, just as the days where individual investors can beat the market by eyeballing charts are long gone, so the days when finance professionals can earn consistent, above-market-index results by relatively unsophisticated and statistically questionable techniques are also rapidly fading into the sunset.

Indeed, how can any individual or organization armed only with relatively modest techniques possibly compete with highly professional teams that incorporate massive databases, real-time data such as satellite imagery and social media, and which utilize very sophisticated mathematical, statistical and machine learning techniques, running on state-of-the-art computer systems?

They can’t. So if you can’t fight them, by all means join them. This book will show the way.

Comments are closed.