/dev/reading
Category

Data Science

125 books, 6 subcategories
Order by
View
Avoiding Pitfalls and Breaking Dilemmas
by Panos Alexopoulos

What value does semantic data modeling offer? As an information architect or data science professional, let’s say you have an abundance of the right data and the technology to extract business gold—but you still fail. The reason? Bad data semantics.

In this practical and comprehensive field guide, author Panos Alexopoulos takes you on an eye-opening journey through semantic data modeling as applied in the real world. You’ll learn how to master this craft to increase the usability and value of your data and applications. You’ll also explore the pitfalls to avoid and dilemmas to overcome for building high-quality and valuable semantic representations of data.

  • Understand the fundamental concepts, phenomena, and processes related to semantic data modeling
  • Examine the quirks and challenges of semantic data modeling and learn how to effectively leverage the available frameworks and tools
  • Avoid mistakes and bad practices that can undermine your efforts to create good data models
  • Learn about model development dilemmas, including representation, expressiveness and content, development, and governance
  • Organize and execute semantic data initiatives in your organization, tackling technical, strategic, and organizational challenges
Architecting, Designing, and Deploying on the Snowflake Data Cloud
by Joyce Kay Avila

Snowflake's ability to eliminate data silos and run workloads from a single platform creates opportunities to democratize data analytics, allowing users at all levels within an organization to make data-driven decisions. Whether you're an IT professional working in data warehousing or data science, a business analyst or technical manager, or an aspiring data professional wanting to get more hands-on experience with the Snowflake platform, this book is for you.

You'll learn how Snowflake users can build modern integrated data applications and develop new revenue streams based on data. Using hands-on SQL examples, you'll also discover how the Snowflake Data Cloud helps you accelerate data science by avoiding replatforming or migrating data unnecessarily.

You'll be able to:

  • Efficiently capture, store, and process large amounts of data at an amazing speed
  • Ingest and transform real-time data feeds in both structured and semistructured formats and deliver meaningful data insights within minutes
  • Use Snowflake Time Travel and zero-copy cloning to produce a sensible data recovery strategy that balances system resilience with ongoing storage costs
  • Securely share data and reduce or eliminate data integration costs by accessing ready-to-query datasets available in the Snowflake Marketplace
by Michael S. Malak and Robin East

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

Covers Apache Spark 3 with Examples in Java, Python, and Scala
by Jean-Georges Perrin

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In

Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.

Big Data Processing Made Simple
by Bill Chambers and Matei Zaharia

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

Youâ??ll explore the basic operations and common functions of Sparkâ??s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ??s scalable machine-learning library.

  • Get a gentle overview of big data and Spark
  • Learn about DataFrames, SQL, and Datasetsâ??Sparkâ??s core APIsâ??through worked examples
  • Dive into Sparkâ??s low-level APIs, RDDs, and execution of SQL and DataFrames
  • Understand how Spark runs on a cluster
  • Debug, monitor, and tune Spark clusters and applications
  • Learn the power of Structured Streaming, Sparkâ??s stream-processing engine
  • Learn how you can apply MLlib to a variety of problems, including classification or recommendation
Advanced Techniques for Transforming Data into Insights
by Cathy Tanimura

With the explosion of data, computing power, and cloud data warehouses, SQL has become an even more indispensable tool for the savvy analyst or data scientist. This practical book reveals new and hidden ways to improve your SQL skills, solve problems, and make the most of SQL as part of your workflow.

You'll learn how to use both common and exotic SQL functions such as joins, window functions, subqueries, and regular expressions in new, innovative ways--as well as how to combine SQL techniques to accomplish your goals faster, with understandable code. If you work with SQL databases, this is a must-have reference.

  • Learn the key steps for preparing your data for analysis
  • Perform time series analysis using SQL's date and time manipulations
  • Use cohort analysis to investigate how groups change over time
  • Use SQL's powerful functions and operators for text analysis
  • Detect outliers in your data and replace them with alternate values
  • Establish causality using experiment analysis, also known as A/B testing
Better Queries with Dynamic Management Views
by Ian W. Stirk

SQL Server DMVs in Action is a practical guide that shows you how to obtain, interpret, and act on the information captured by DMVs to keep your system in top shape. The samples provided in this book will help you master DMVs and also give you a tested, working, and instantly reusable SQL code library.

by Kalen Delaney, Louis Davidson, Greg Low, Brad McGehee, Paul Nielsen, Paul Randal and Kimberly Tripp

SQL Server MVP Deep Dives, Volume 2 lets you learn from the best in the business—64 SQL Server MVPs offer completely new content in this second volume on topics ranging from testing and policy management to integration services, reporting, and performance optimization techniques...and more.

A Bayesian Course with Examples in R and Stan
by Richard McElreath

Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Second Edition builds knowledge/confidence in statistical modeling. Pushes readers to perform step-by-step calculations (usually automated.) Unique, computational approach.

The Woefully Complete Guide
by Alex Reinhart

Scientific progress depends on good research, and good research needs good statistics. But statistical analysis is tricky to get right, even for the best and brightest of us. You'd be surprised how many scientists are doing it wrong.

Statistics Done Wrong is a pithy, essential guide to statistical blunders in modern science that will show you how to keep your research blunder-free. You'll examine embarrassing errors and omissions in recent research, learn about the misconceptions and scientific politics that allow these mistakes to happen, and begin your quest to reform the way you and your peers do statistics.

You'll find advice on:

  • Asking the right question, designing the right experiment, choosing the right statistical analysis, and sticking to the plan
  • How to think about p values, significance, insignificance, confidence intervals, and regression
  • Choosing the right sample size and avoiding false positives
  • Reporting your analysis and publishing your data and source code
  • Procedures to follow, precautions to take, and analytical software that can help

The first step toward statistics done right is Statistics Done Wrong.

Statistical analysis with R on real NBA data
by Gary Sutton

Learn statistics by analyzing professional basketball data! In this action-packed book, you’ll build your skills in exploratory data analysis by digging into the fascinating world of NBA games and player stats using the R language.

Statistics Slam Dunk is an engaging how-to guide for statistical analysis with R. Each chapter contains an end-to-end data science or statistics project delving into NBA data and revealing real-world sporting insights. Written by a former basketball player turned business intelligence and analytics leader, you’ll get practical experience tidying, wrangling, exploring, testing, modeling, and otherwise analyzing data with the best and latest R packages and functions.

In Statistics Slam Dunk you’ll develop a toolbox of R programming skills including:

  • Reading and writing data
  • Installing and loading packages
  • Transforming, tidying, and wrangling data
  • Applying best-in-class exploratory data analysis techniques
  • Creating compelling visualizations
  • Developing supervised and unsupervised machine learning algorithms
  • Executing hypothesis tests, including t-tests and chi-square tests for independence
  • Computing expected values, Gini coefficients,  z-scores, and other measures

If you’re looking to switch to R from another language, or trade base R for tidyverse functions, this book is the perfect training coach. Much more than a beginner’s guide, it teaches statistics and data science methods that have tons of use cases. And just like in the real world, you’ll get no clean pre-packaged data sets in

Statistics Slam Dunk. You’ll take on the challenge of wrangling messy data to drill on the skills that will make you the star player on any data team.

by Cole Nussbaumer Knaflic

This is not a book. It is a one-of-a-kind immersive learning experience through which you can become—or teach others to be—a powerful data storyteller.

Let’s practice! helps you build confidence and credibility to create graphs and visualizations that make sense and weave them into action-inspiring stories. Expanding upon best seller storytelling with data’s foundational lessons, Let’s practice! delivers fresh content, a plethora of new examples, and over 100 hands-on exercises. Author and data storytelling maven Cole Nussbaumer Knaflic guides you along the path to hone core skills and become a well-practiced data communicator. Each chapter includes:

  • Practice with Cole: exercises based on real-world examples first posed for you to consider and solve, followed by detailed step-by-step illustration and explanation
  • Practice on your own: thought-provoking questions and even more exercises to be assigned or worked through individually, without prescribed solutions
  • Practice at work: practical guidance and hands-on exercises for applying storytelling with data lessons on the job, including instruction on when and how to solicit useful feedback and refine for greater impact

The lessons and exercises found within this comprehensive guide will empower you to master—or develop in others—data storytelling skills and transition your work from acceptable to exceptional. By investing in these skills for ourselves and our teams, we can all tell inspiring and influential data stories!

How to Find, Organize, and Manipulate It
by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris

Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

Build Dashboards with Python and Plotly
by Adam Schroeder, Christian Mayer and Ann Marie Ward

A swift and practical introduction to building interactive data visualization apps in Python, known as dashboards. You’ve seen dashboards before; think election result visualizations you can update in real time, or population maps you can filter by demographic. With the Python Dash library you’ll create analytic dashboards that present data in effective, usable, elegant ways in just a few lines of code.

The book is fast-paced and caters to those entirely new to dashboards. It will talk you through the necessary software, then get straight into building the dashboards themselves. You’ll learn the basic format of a Dash app by building a twitter analysis dashboard that maps the number of likes certain accounts gained over time. You’ll build up skills through three more sophisticated projects. The first is a global analysis app that compares country data in three areas: the percentage of a population using the internet, percentage of parliament seats held by women, and CO2 emissions. You’ll then build an investment portfolio dashboard, and an app that allows you to visualize and explore machine learning algorithms.

In this book you will:

  • Create and run your first Dash apps
  • Use the pandas library to manipulate and analyze social media data
  • Use Git to download and build on existing apps written by the pros
  • Visualize machine learning models in your apps
  • Create and manipulate statistical and scientific charts and maps using Plotly

Dash combines several technologies to get you building dashboards quickly and efficiently. This book will do the same.

A Guide to Building Robust Cloud Data Architecture
by Rukmani Gopalan

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.

This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.

  • Learn the benefits of a cloud-based big data strategy for your organization
  • Get guidance and best practices for designing performant and scalable data lakes
  • Examine architecture and design choices, and data governance principles and strategies
  • Build a data strategy that scales as your organizational and business needs increase
  • Implement a scalable data lake in the cloud
  • Use cloud-based advanced analytics to gain more value from your data
The Definitive Guide to Dimensional Modeling
by Ralph Kimball and Margy Ross

The first edition of Ralph Kimball's The Data Warehouse Toolkit introduced the industry to dimensional modeling, and now his books are considered the most authoritative guides in this space. This new third edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever. It covers new and enhanced star schema dimensional modeling patterns, adds two new chapters on ETL techniques, includes new and expanded business matrices for 12 case studies, and more.

  • Authored by Ralph Kimball and Margy Ross, known worldwide as educators, consultants, and influential thought leaders in data warehousing and business intelligence
  • Begins with fundamental design recommendations and progresses through increasingly complex scenarios
  • Presents unique modeling techniques for business applications such as inventory management, procurement, invoicing, accounting, customer relationship management, big data analytics, and more
  • Draws real-world case studies from a variety of industries, including retail sales, financial services, telecommunications, education, health care, insurance, e-commerce, and more

Design dimensional databases that are easy to understand and provide fast query response with The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition.

by Chris A. Mattmann and Jukka L. Zitting

Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.

by Marco Peixeiro

Build predictive models from time-based patterns in your data. Master statistical models including new deep learning approaches for time series forecasting.

In Time Series Forecasting in Python you will learn how to:

  • Recognize a time series forecasting problem and build a performant predictive model
  • Create univariate forecasting models that account for seasonal effects and external variables
  • Build multivariate forecasting models to predict many time series at once
  • Leverage large datasets by using deep learning for forecasting time series
  • Automate the forecasting process

Time Series Forecasting in Python teaches you to build powerful predictive models from time-based data. Every model you create is relevant, useful, and easy to implement with Python. You’ll explore interesting real-world datasets like Google’s daily stock price and economic data for the USA, quickly progressing from the basics to developing large-scale models that use deep learning tools like TensorFlow.

by Paul Azunre

Build custom NLP models in record time by adapting pre-trained machine learning models to solve specialized problems.

In Transfer Learning for Natural Language Processing you will learn:

  • Fine tuning pretrained models with new domain data
  • Picking the right model to reduce resource usage
  • Transfer learning for neural network architectures
  • Generating text with generative pretrained transformers
  • Cross-lingual transfer learning with BERT
  • Foundations for exploring NLP academic literature

Training deep learning NLP models from scratch is costly, time-consuming, and requires massive amounts of data. In

Transfer Learning for Natural Language Processing, DARPA researcher Paul Azunre reveals cutting-edge transfer learning techniques that apply customizable pretrained models to your own NLP architectures. You’ll learn how to use transfer learning to deliver state-of-the-art results for language comprehension, even when working with limited label data. Best of all, you’ll save on training time and computational costs.

Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT4-V, and DALL-E 3
by Denis Rothman

Transformers for Natural Language Processing and Computer Vision, Third Edition, explores Large Language Model (LLM) architectures, applications, and various platforms (Hugging Face, OpenAI, and Google Vertex AI) used for Natural Language Processing (NLP) and Computer Vision (CV).

The book guides you through different transformer architectures to the latest Foundation Models and Generative AI. You’ll pretrain and fine-tune LLMs and work through different use cases, from summarization to implementing question-answering systems with embedding-based search techniques. You will also learn the risks of LLMs, from hallucinations and memorization to privacy, and how to mitigate such risks using moderation models with rule and knowledge bases. You’ll implement Retrieval Augmented Generation (RAG) with LLMs to improve the accuracy of your models and gain greater control over LLM outputs.

Dive into generative vision transformers and multimodal model architectures and build applications, such as image and video-to-text classifiers. Go further by combining different models and platforms and learning about AI agent replication.

This book provides you with an understanding of transformer architectures, pretraining, fine-tuning, LLM use cases, and best practices.

What you will learn

  • Learn how to pretrain and fine-tune LLMs
  • Learn how to work with multiple platforms, such as Hugging Face, OpenAI, and Google Vertex AI
  • Learn about different tokenizers and the best practices for preprocessing language data
  • Implement Retrieval Augmented Generation and rules bases to mitigate hallucinations
  • Visualize transformer model activity for deeper insights using BertViz, LIME, and SHAP
  • Create and implement cross-platform chained models, such as HuggingGPT
  • Go in-depth into vision transformers with CLIP, DALL-E 2, DALL-E 3, and GPT-4V

Who this book is for

This book is ideal for NLP and CV engineers, software developers, data scientists, machine learning engineers, and technical leaders looking to advance their LLMs and generative AI skills or explore the latest trends in the field. Knowledge of Python and machine learning concepts is required to fully understand the use cases and code examples. However, with examples using LLM user interfaces, prompt engineering, and no-code model building, this book is great for anyone curious about the AI revolution.

SQL at Any Scale, on Any Storage, in Any Environment
by Matt Fuller, Manfred Moser and Martin Traverso

Perform fast interactive analytics against different data sources using the Trino high-performance distributed SQL query engine. In the second edition of this practical guide, you'll learn how to conduct analytics on data where it lives, whether it's a data lake using Hive, a modern lakehouse with Iceberg or Delta Lake, a different system like Cassandra, Kafka, or SingleStore, or a relational database like PostgreSQL or Oracle.

Analysts, software engineers, and production engineers learn how to manage, use, and even develop with Trino and make it a critical part of their data platform. Authors Matt Fuller, Manfred Moser, and Martin Traverso show you how a single Trino query can combine data from multiple sources to allow for analytics across your entire organization.

  • Explore Trino's use cases, and learn about tools that help you connect to Trino for querying and processing huge amounts of data
  • Learn Trino's internal workings, including how to connect to and query data sources with support for SQL statements, operators, functions, and more
  • Deploy and secure Trino at scale, monitor workloads, tune queries, and connect more applications
  • Learn how other organizations apply Trino successfully
by Corey L. Lanum

Visualizing Graph Data teaches you not only how to build graph data structures, but also how to create your own dynamic and interactive visualizations using a variety of tools. This book is loaded with fascinating examples and case studies to show you the real-world value of graph visualizations.

Data Extraction from the Modern Web
by Ryan Mitchell

If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

  • Parse complicated HTML pages
  • Develop crawlers with the Scrapy framework
  • Learn methods to store the data you scrape
  • Read and extract data from documents
  • Clean and normalize badly formatted data
  • Read and write natural languages
  • Crawl through forms and logins
  • Scrape JavaScript and crawl through APIs
  • Use and write image-to-text software
  • Avoid scraping traps and bot blockers
  • Use scrapers to test your website
A Guide to Developing Internet Agents with PHP/CURL
by Michael Schrenk

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:

  • Send email or SMS notifications to alert you to new information quickly
  • Search different data sources and combine the results on one page, making the data easier to interpret and analyze
  • Automate purchases, auction bids, and other online activities to save time

Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.

This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.