Data Science

125 books, 6 subcategories

Data Architectures

10 books

Data Visualization

6 books

Databases

16 books

Hadoop

4 books

Natural Language Processing

9 books

Spark

4 books

Order by

View

Semantic Modeling for Data

Avoiding Pitfalls and Breaking Dilemmas

by Panos Alexopoulos

What value does semantic data modeling offer? As an information architect or data science professional, let’s say you have an abundance of the right data and the technology to extract business gold—but you still fail. The reason? Bad data semantics.

In this practical and comprehensive field guide, author Panos Alexopoulos takes you on an eye-opening journey through semantic data modeling as applied in the real world. You’ll learn how to master this craft to increase the usability and value of your data and applications. You’ll also explore the pitfalls to avoid and dilemmas to overcome for building high-quality and valuable semantic representations of data.

Understand the fundamental concepts, phenomena, and processes related to semantic data modeling
Examine the quirks and challenges of semantic data modeling and learn how to effectively leverage the available frameworks and tools
Avoid mistakes and bad practices that can undermine your efforts to create good data models
Learn about model development dilemmas, including representation, expressiveness and content, development, and governance
Organize and execute semantic data initiatives in your organization, tackling technical, strategic, and organizational challenges

About the book

4.35/5 on Goodreads

ISBN 9781492054276

Published in 2020

328 pages

O'Reilly Media

Snowflake: The Definitive Guide

Architecting, Designing, and Deploying on the Snowflake Data Cloud

by Joyce Kay Avila

Snowflake's ability to eliminate data silos and run workloads from a single platform creates opportunities to democratize data analytics, allowing users at all levels within an organization to make data-driven decisions. Whether you're an IT professional working in data warehousing or data science, a business analyst or technical manager, or an aspiring data professional wanting to get more hands-on experience with the Snowflake platform, this book is for you.

You'll learn how Snowflake users can build modern integrated data applications and develop new revenue streams based on data. Using hands-on SQL examples, you'll also discover how the Snowflake Data Cloud helps you accelerate data science by avoiding replatforming or migrating data unnecessarily.

You'll be able to:

Efficiently capture, store, and process large amounts of data at an amazing speed
Ingest and transform real-time data feeds in both structured and semistructured formats and deliver meaningful data insights within minutes
Use Snowflake Time Travel and zero-copy cloning to produce a sensible data recovery strategy that balances system resilience with ongoing storage costs
Securely share data and reduce or eliminate data integration costs by accessing ready-to-query datasets available in the Snowflake Marketplace

About the book

3.63/5 on Goodreads

ISBN 9781098103828

Published in 2022

465 pages

O'Reilly Media

Spark GraphX in Action

by Michael S. Malak and Robin East

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

About the book

3.56/5 on Goodreads

ISBN 9781617292521

Published in 2016

280 pages

Manning Publications

Spark in Action, 2nd Edition

Covers Apache Spark 3 with Examples in Java, Python, and Scala

by Jean-Georges Perrin

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In

Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.

About the book

3.96/5 on Goodreads

ISBN 9781617295522

Published in 2020

576 pages

Manning Publications

Spark: The Definitive Guide

Big Data Processing Made Simple

by Bill Chambers and Matei Zaharia

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

Youâ??ll explore the basic operations and common functions of Sparkâ??s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ??s scalable machine-learning library.

Get a gentle overview of big data and Spark
Learn about DataFrames, SQL, and Datasetsâ??Sparkâ??s core APIsâ??through worked examples
Dive into Sparkâ??s low-level APIs, RDDs, and execution of SQL and DataFrames
Understand how Spark runs on a cluster
Debug, monitor, and tune Spark clusters and applications
Learn the power of Structured Streaming, Sparkâ??s stream-processing engine
Learn how you can apply MLlib to a variety of problems, including classification or recommendation

About the book

4.14/5 on Goodreads

ISBN 9781491912218

Published in 2018

603 pages

O'Reilly Media

SQL for Data Analysis

Advanced Techniques for Transforming Data into Insights

by Cathy Tanimura

With the explosion of data, computing power, and cloud data warehouses, SQL has become an even more indispensable tool for the savvy analyst or data scientist. This practical book reveals new and hidden ways to improve your SQL skills, solve problems, and make the most of SQL as part of your workflow.

You'll learn how to use both common and exotic SQL functions such as joins, window functions, subqueries, and regular expressions in new, innovative ways--as well as how to combine SQL techniques to accomplish your goals faster, with understandable code. If you work with SQL databases, this is a must-have reference.

Learn the key steps for preparing your data for analysis
Perform time series analysis using SQL's date and time manipulations
Use cohort analysis to investigate how groups change over time
Use SQL's powerful functions and operators for text analysis
Detect outliers in your data and replace them with alternate values
Establish causality using experiment analysis, also known as A/B testing

About the book

4.65/5 on Goodreads

ISBN 9781492088783

Published in 2021

357 pages

O'Reilly Media

SQL Server DMVs in Action

Better Queries with Dynamic Management Views

by Ian W. Stirk

SQL Server DMVs in Action is a practical guide that shows you how to obtain, interpret, and act on the information captured by DMVs to keep your system in top shape. The samples provided in this book will help you master DMVs and also give you a tested, working, and instantly reusable SQL code library.

About the book

3.91/5 on Goodreads

ISBN 9781935182733

Published in 2011

352 pages

Manning Publications

SQL Server MVP Deep Dives, Volume 2

by Kalen Delaney, Louis Davidson, Greg Low, Brad McGehee, Paul Nielsen, Paul Randal and Kimberly Tripp

SQL Server MVP Deep Dives, Volume 2 lets you learn from the best in the business—64 SQL Server MVPs offer completely new content in this second volume on topics ranging from testing and policy management to integration services, reporting, and performance optimization techniques...and more.

About the book

4.42/5 on Goodreads

ISBN 9781617290473

Published in 2011

688 pages

Manning Publications

Statistical Rethinking, 2nd Edition

A Bayesian Course with Examples in R and Stan

by Richard McElreath

Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Second Edition builds knowledge/confidence in statistical modeling. Pushes readers to perform step-by-step calculations (usually automated.) Unique, computational approach.

About the book

4.72/5 on Goodreads

ISBN 9780367139919

Published in 2020

612 pages

CRC Press

Statistics Done Wrong

The Woefully Complete Guide

by Alex Reinhart

Scientific progress depends on good research, and good research needs good statistics. But statistical analysis is tricky to get right, even for the best and brightest of us. You'd be surprised how many scientists are doing it wrong.

Statistics Done Wrong is a pithy, essential guide to statistical blunders in modern science that will show you how to keep your research blunder-free. You'll examine embarrassing errors and omissions in recent research, learn about the misconceptions and scientific politics that allow these mistakes to happen, and begin your quest to reform the way you and your peers do statistics.

You'll find advice on:

Asking the right question, designing the right experiment, choosing the right statistical analysis, and sticking to the plan
How to think about p values, significance, insignificance, confidence intervals, and regression
Choosing the right sample size and avoiding false positives
Reporting your analysis and publishing your data and source code
Procedures to follow, precautions to take, and analytical software that can help

The first step toward statistics done right is Statistics Done Wrong.

About the book

4.18/5 on Goodreads

ISBN 9781593276201

Published in 2015

176 pages

No Starch Press

Statistics Slam Dunk

Statistical analysis with R on real NBA data

by Gary Sutton

Learn statistics by analyzing professional basketball data! In this action-packed book, you’ll build your skills in exploratory data analysis by digging into the fascinating world of NBA games and player stats using the R language.

Statistics Slam Dunk is an engaging how-to guide for statistical analysis with R. Each chapter contains an end-to-end data science or statistics project delving into NBA data and revealing real-world sporting insights. Written by a former basketball player turned business intelligence and analytics leader, you’ll get practical experience tidying, wrangling, exploring, testing, modeling, and otherwise analyzing data with the best and latest R packages and functions.

In Statistics Slam Dunk you’ll develop a toolbox of R programming skills including:

Reading and writing data
Installing and loading packages
Transforming, tidying, and wrangling data
Applying best-in-class exploratory data analysis techniques
Creating compelling visualizations
Developing supervised and unsupervised machine learning algorithms
Executing hypothesis tests, including t-tests and chi-square tests for independence
Computing expected values, Gini coefficients, z-scores, and other measures

If you’re looking to switch to R from another language, or trade base R for tidyverse functions, this book is the perfect training coach. Much more than a beginner’s guide, it teaches statistics and data science methods that have tons of use cases. And just like in the real world, you’ll get no clean pre-packaged data sets in

Statistics Slam Dunk. You’ll take on the challenge of wrangling messy data to drill on the skills that will make you the star player on any data team.

About the book

5/5 on Goodreads

ISBN 9781633438682

Published in 2024

672 pages

Manning Publications

Storytelling with Data

by Cole Nussbaumer Knaflic

This is not a book. It is a one-of-a-kind immersive learning experience through which you can become—or teach others to be—a powerful data storyteller.

Let’s practice! helps you build confidence and credibility to create graphs and visualizations that make sense and weave them into action-inspiring stories. Expanding upon best seller storytelling with data’s foundational lessons, Let’s practice! delivers fresh content, a plethora of new examples, and over 100 hands-on exercises. Author and data storytelling maven Cole Nussbaumer Knaflic guides you along the path to hone core skills and become a well-practiced data communicator. Each chapter includes:

Practice with Cole: exercises based on real-world examples first posed for you to consider and solve, followed by detailed step-by-step illustration and explanation
Practice on your own: thought-provoking questions and even more exercises to be assigned or worked through individually, without prescribed solutions
Practice at work: practical guidance and hands-on exercises for applying storytelling with data lessons on the job, including instruction on when and how to solicit useful feedback and refine for greater impact

The lessons and exercises found within this comprehensive guide will empower you to master—or develop in others—data storytelling skills and transition your work from acceptable to exceptional. By investing in these skills for ourselves and our teams, we can all tell inspiring and influential data stories!

About the book

4.47/5 on Goodreads

ISBN 9781119621492

Published in 2019

448 pages

Wiley

Taming Text

How to Find, Organize, and Manipulate It

by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris

Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

About the book

3.78/5 on Goodreads

ISBN 9781933988382

Published in 2012

320 pages

Manning Publications

The Book of Dash

Build Dashboards with Python and Plotly

by Adam Schroeder, Christian Mayer and Ann Marie Ward

A swift and practical introduction to building interactive data visualization apps in Python, known as dashboards. You’ve seen dashboards before; think election result visualizations you can update in real time, or population maps you can filter by demographic. With the Python Dash library you’ll create analytic dashboards that present data in effective, usable, elegant ways in just a few lines of code.

The book is fast-paced and caters to those entirely new to dashboards. It will talk you through the necessary software, then get straight into building the dashboards themselves. You’ll learn the basic format of a Dash app by building a twitter analysis dashboard that maps the number of likes certain accounts gained over time. You’ll build up skills through three more sophisticated projects. The first is a global analysis app that compares country data in three areas: the percentage of a population using the internet, percentage of parliament seats held by women, and CO2 emissions. You’ll then build an investment portfolio dashboard, and an app that allows you to visualize and explore machine learning algorithms.

In this book you will:

Create and run your first Dash apps
Use the pandas library to manipulate and analyze social media data
Use Git to download and build on existing apps written by the pros
Visualize machine learning models in your apps
Create and manipulate statistical and scientific charts and maps using Plotly

Dash combines several technologies to get you building dashboards quickly and efficiently. This book will do the same.

About the book

3.67/5 on Goodreads

ISBN 9781718502222

Published in 2022

224 pages

No Starch Press

The Cloud Data Lake

A Guide to Building Robust Cloud Data Architecture

by Rukmani Gopalan

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.

This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.

Learn the benefits of a cloud-based big data strategy for your organization
Get guidance and best practices for designing performant and scalable data lakes
Examine architecture and design choices, and data governance principles and strategies
Build a data strategy that scales as your organizational and business needs increase
Implement a scalable data lake in the cloud
Use cloud-based advanced analytics to gain more value from your data

About the book

4.33/5 on Goodreads

ISBN 9781098116583

Published in 2022

244 pages

O'Reilly Media

The Data Warehouse Toolkit, 3rd Edition

The Definitive Guide to Dimensional Modeling

by Ralph Kimball and Margy Ross

The first edition of Ralph Kimball's The Data Warehouse Toolkit introduced the industry to dimensional modeling, and now his books are considered the most authoritative guides in this space. This new third edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever. It covers new and enhanced star schema dimensional modeling patterns, adds two new chapters on ETL techniques, includes new and expanded business matrices for 12 case studies, and more.

Authored by Ralph Kimball and Margy Ross, known worldwide as educators, consultants, and influential thought leaders in data warehousing and business intelligence
Begins with fundamental design recommendations and progresses through increasingly complex scenarios
Presents unique modeling techniques for business applications such as inventory management, procurement, invoicing, accounting, customer relationship management, big data analytics, and more
Draws real-world case studies from a variety of industries, including retail sales, financial services, telecommunications, education, health care, insurance, e-commerce, and more

Design dimensional databases that are easy to understand and provide fast query response with The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition.

About the book

4.17/5 on Goodreads

ISBN 9781118530801

Published in 2013

600 pages

Wiley

Tika in Action

by Chris A. Mattmann and Jukka L. Zitting

Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.

About the book

3.88/5 on Goodreads

ISBN 9781935182856

Published in 2011

256 pages

Manning Publications

Time Series Forecasting in Python

by Marco Peixeiro

Build predictive models from time-based patterns in your data. Master statistical models including new deep learning approaches for time series forecasting.

In Time Series Forecasting in Python you will learn how to:

Recognize a time series forecasting problem and build a performant predictive model
Create univariate forecasting models that account for seasonal effects and external variables
Build multivariate forecasting models to predict many time series at once
Leverage large datasets by using deep learning for forecasting time series
Automate the forecasting process

Time Series Forecasting in Python teaches you to build powerful predictive models from time-based data. Every model you create is relevant, useful, and easy to implement with Python. You’ll explore interesting real-world datasets like Google’s daily stock price and economic data for the USA, quickly progressing from the basics to developing large-scale models that use deep learning tools like TensorFlow.

About the book

4.29/5 on Goodreads

ISBN 9781617299889

Published in 2022

456 pages

Manning Publications

Transfer Learning for Natural Language Processing

by Paul Azunre

Build custom NLP models in record time by adapting pre-trained machine learning models to solve specialized problems.

In Transfer Learning for Natural Language Processing you will learn:

Fine tuning pretrained models with new domain data
Picking the right model to reduce resource usage
Transfer learning for neural network architectures
Generating text with generative pretrained transformers
Cross-lingual transfer learning with BERT
Foundations for exploring NLP academic literature

Training deep learning NLP models from scratch is costly, time-consuming, and requires massive amounts of data. In

Transfer Learning for Natural Language Processing, DARPA researcher Paul Azunre reveals cutting-edge transfer learning techniques that apply customizable pretrained models to your own NLP architectures. You’ll learn how to use transfer learning to deliver state-of-the-art results for language comprehension, even when working with limited label data. Best of all, you’ll save on training time and computational costs.

About the book

4/5 on Goodreads

ISBN 9781617297267

Published in 2021

272 pages

Manning Publications

Transformers for Natural Language Processing and Computer Vision, 3rd Edition

Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT4-V, and DALL-E 3

by Denis Rothman

Transformers for Natural Language Processing and Computer Vision, Third Edition, explores Large Language Model (LLM) architectures, applications, and various platforms (Hugging Face, OpenAI, and Google Vertex AI) used for Natural Language Processing (NLP) and Computer Vision (CV).

The book guides you through different transformer architectures to the latest Foundation Models and Generative AI. You’ll pretrain and fine-tune LLMs and work through different use cases, from summarization to implementing question-answering systems with embedding-based search techniques. You will also learn the risks of LLMs, from hallucinations and memorization to privacy, and how to mitigate such risks using moderation models with rule and knowledge bases. You’ll implement Retrieval Augmented Generation (RAG) with LLMs to improve the accuracy of your models and gain greater control over LLM outputs.

Dive into generative vision transformers and multimodal model architectures and build applications, such as image and video-to-text classifiers. Go further by combining different models and platforms and learning about AI agent replication.

This book provides you with an understanding of transformer architectures, pretraining, fine-tuning, LLM use cases, and best practices.

What you will learn

Learn how to pretrain and fine-tune LLMs
Learn how to work with multiple platforms, such as Hugging Face, OpenAI, and Google Vertex AI
Learn about different tokenizers and the best practices for preprocessing language data
Implement Retrieval Augmented Generation and rules bases to mitigate hallucinations
Visualize transformer model activity for deeper insights using BertViz, LIME, and SHAP
Create and implement cross-platform chained models, such as HuggingGPT
Go in-depth into vision transformers with CLIP, DALL-E 2, DALL-E 3, and GPT-4V

Who this book is for

This book is ideal for NLP and CV engineers, software developers, data scientists, machine learning engineers, and technical leaders looking to advance their LLMs and generative AI skills or explore the latest trends in the field. Knowledge of Python and machine learning concepts is required to fully understand the use cases and code examples. However, with examples using LLM user interfaces, prompt engineering, and no-code model building, this book is great for anyone curious about the AI revolution.

About the book

0/5 on Goodreads

ISBN 9781805128724

Published in 2024

728 pages

Packt Publishing

Trino: The Definitive Guide, 2nd Edition

SQL at Any Scale, on Any Storage, in Any Environment

by Matt Fuller, Manfred Moser and Martin Traverso

Perform fast interactive analytics against different data sources using the Trino high-performance distributed SQL query engine. In the second edition of this practical guide, you'll learn how to conduct analytics on data where it lives, whether it's a data lake using Hive, a modern lakehouse with Iceberg or Delta Lake, a different system like Cassandra, Kafka, or SingleStore, or a relational database like PostgreSQL or Oracle.

Analysts, software engineers, and production engineers learn how to manage, use, and even develop with Trino and make it a critical part of their data platform. Authors Matt Fuller, Manfred Moser, and Martin Traverso show you how a single Trino query can combine data from multiple sources to allow for analytics across your entire organization.

Explore Trino's use cases, and learn about tools that help you connect to Trino for querying and processing huge amounts of data
Learn Trino's internal workings, including how to connect to and query data sources with support for SQL statements, operators, functions, and more
Deploy and secure Trino at scale, monitor workloads, tune queries, and connect more applications
Learn how other organizations apply Trino successfully

About the book

3.5/5 on Goodreads

ISBN 9781098137236

Published in 2022

319 pages

O'Reilly Media

Visualizing Graph Data

by Corey L. Lanum

Visualizing Graph Data teaches you not only how to build graph data structures, but also how to create your own dynamic and interactive visualizations using a variety of tools. This book is loaded with fascinating examples and case studies to show you the real-world value of graph visualizations.

About the book

3.75/5 on Goodreads

ISBN 9781617293078

Published in 2016

232 pages

Manning Publications

Web Scraping with Python, 3rd Edition

Data Extraction from the Modern Web

by Ryan Mitchell

If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

Parse complicated HTML pages
Develop crawlers with the Scrapy framework
Learn methods to store the data you scrape
Read and extract data from documents
Clean and normalize badly formatted data
Read and write natural languages
Crawl through forms and logins
Scrape JavaScript and crawl through APIs
Use and write image-to-text software
Avoid scraping traps and bot blockers
Use scrapers to test your website

About the book

0/5 on Goodreads

ISBN 9781098145354

Published in 2024

352 pages

O'Reilly Media

Webbots, Spiders, and Screen Scrapers, 2nd Edition

A Guide to Developing Internet Agents with PHP/CURL

by Michael Schrenk

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:

Send email or SMS notifications to alert you to new information quickly
Search different data sources and combine the results on one page, making the data easier to interpret and analyze
Automate purchases, auction bids, and other online activities to save time

Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.

This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.

About the book

3.75/5 on Goodreads

ISBN 9781593273972

Published in 2012

392 pages

No Starch Press