Data Science

Algorithms of the Intelligent Web, 2nd Edition

by Douglas G. McIlwraith, Haralambos Marmanis and Dmitry Babenko

Algorithms of the Intelligent Web, Second Edition teaches the most important approaches to algorithmic web data analysis, enabling you to create your own machine learning applications that crunch, munge, and wrangle data collected from users, web applications, sensors and website logs.

About the book

3.57/5 on Goodreads

ISBN 9781617292583

Published in 2016

240 pages

Analytics Engineering with SQL and dbt

Building Meaningful Data Models at Scale

by Rui Pedro Machado and Helder Russa

With the shift from data warehouses to data lakes, data now lands in repositories before it's been transformed, enabling engineers to model raw data into clean, well-defined datasets. dbt (data build tool) helps you take data further. This practical book shows data analysts, data engineers, BI developers, and data scientists how to create a true self-service transformation platform through the use of dynamic SQL.

Authors Rui Machado from Monstarlab and Hélder Russa from Jumia show you how to quickly deliver new data products by focusing more on value delivery and less on architectural and engineering aspects. If you know your business well and have the technical skills to model raw data into clean, well-defined datasets, you'll learn how to design and deliver data models without any technical influence.

With this book, you'll learn:

What dbt is and how a dbt project is structured
How dbt fits into the data engineering and analytics worlds
How to collaborate on building data models
The main tools and architectures for building useful, functional data models
How to fit dbt into data warehousing and laking architecture
How to build tests for data transformations

About the book

5/5 on Goodreads

ISBN 9781098142384

Published in 2023

324 pages

O'Reilly Media

Architecting Data and Machine Learning Platforms

Enable Analytics and AI-Driven Innovation in the Cloud

by Marco Tranquillin, Valliappa Lakshmanan and Firat Tekiner

All cloud architects need to know how to build data platforms that enable businesses to make data-driven decisions and deliver enterprise-wide intelligence in a fast and efficient way. This handbook shows you how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, and multicloud tools like Snowflake and Databricks.

Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle from ingestion to activation in a cloud environment using real-world enterprise architectures. You'll learn how to transform, secure, and modernize familiar solutions like data warehouses and data lakes, and you'll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage.

You'll learn how to:

Design a modern and secure cloud native or hybrid data analytics and machine learning platform
Accelerate data-led innovation by consolidating enterprise data in a governed, scalable, and resilient data platform
Democratize access to enterprise data and govern how business teams extract insights and build AI/ML capabilities
Enable your business to make decisions in real time using streaming pipelines
Build an MLOps platform to move to a predictive and prescriptive analytics approach

About the book

4/5 on Goodreads

ISBN 9781098151614

Published in 2023

359 pages

O'Reilly Media

Art of R Programming

A Tour of Statistical Software Design

by Norman Matloff

R is the world's most popular language for developing statistical software: Archaeologists use it to track the spread of ancient civilizations, drug companies use it to discover which medications are safe and effective, and actuaries use it to assess financial risks and keep economies running smoothly.

The Art of R Programming takes you on a guided tour of software development with R, from basic types and data structures to advanced topics like closures, recursion, and anonymous functions. No statistical knowledge is required, and your programming skills can range from hobbyist to pro.

Along the way, you'll learn about functional and object-oriented programming, running mathematical simulations, and rearranging complex data into simpler, more useful formats. You'll also learn to:

Create artful graphs to visualize complex data sets and functions
Write more efficient code using parallel R and vectorization
Interface R with C/C++ and Python for increased speed or functionality
Find new R packages for text analysis, image manipulation, and more
Squash annoying bugs with advanced debugging techniques

Whether you're designing aircraft, forecasting the weather, or you just need to tame your data, The Art of R Programming is your guide to harnessing the power of statistical computing.

About the book

4.02/5 on Goodreads

ISBN 9781593273842

Published in 2011

400 pages

Bayesian Analysis with Python, 3rd Edition

A practical guide to probabilistic modelling

by Osvaldo Martin

The third edition of Bayesian Analysis with Python serves as an introduction to the main concepts of applied Bayesian modeling using PyMC, a state-of-the-art probabilistic programming library, and other libraries that support and facilitate modeling like ArviZ, for exploratory analysis of Bayesian models; Bambi, for flexible and easy hierarchical linear modeling; PreliZ, for prior elicitation; PyMC-BART, for flexible non-parametric regression; and Kulprit, for variable selection.

In this updated edition, a brief and conceptual introduction to probability theory enhances your learning journey by introducing new topics like Bayesian additive regression trees (BART), featuring updated examples. Refined explanations, informed by feedback and experience from previous editions, underscore the book's emphasis on Bayesian statistics. You will explore various models, including hierarchical models, generalized linear models for regression and classification, mixture models, Gaussian processes, and BART, using synthetic and real datasets.

By the end of this book, you will possess a functional understanding of probabilistic modeling, enabling you to design and implement Bayesian models for your data science challenges. You'll be well-prepared to delve into more advanced material or specialized statistical modeling if the need arises.

What you will learn

Build probabilistic models using PyMC and Bambi
Analyze and interpret probabilistic models with ArviZ
Acquire the skills to sanity-check models and modify them if necessary
Build better models with prior and posterior predictive checks
Learn the advantages and caveats of hierarchical models
Compare models and choose between alternative ones
Interpret results and apply your knowledge to real-world problems
Explore common models from a unified probabilistic perspective
Apply the Bayesian framework's flexibility for probabilistic thinking

Who this book is for

If you are a student, data scientist, researcher, or developer looking to get started with Bayesian data analysis and probabilistic programming, this book is for you. The book is introductory, so no previous statistical knowledge is required, although some experience in using Python and scientific libraries like NumPy is expected.

About the book

5/5 on Goodreads

ISBN 9781805127161

Published in 2024

394 pages

Bayesian Statistics the Fun Way

Understanding Statistics and Probability with Star Wars, LEGO, and Rubber Ducks

by Will Kurt

Get the most from your data, and have fun doing it

Probability and statistics are increasingly important in a huge range of professions. But many people use data in ways they don’t even understand, meaning they aren’t getting the most from it. Bayesian Statistics the Fun Way will change that.

This book will give you a complete understanding of Bayesian statistics through simple explanations and un-boring examples. Find out the probability of UFOs landing in your garden, how likely Han Solo is to survive a flight through an asteroid belt, how to win an argument about conspiracy theories, and whether a burglary really was a burglary, to name a few examples.

By using these off-the-beaten-track examples, the author actually makes learning statistics fun. And you’ll learn real skills, like how to:

How to measure your own level of uncertainty in a conclusion or belief
Calculate Bayes theorem and understand what it’s useful for
Find the posterior, likelihood, and prior to check the accuracy of your conclusions
Calculate distributions to see the range of your data
Compare hypotheses and draw reliable conclusions from them

Next time you find yourself with a sheaf of survey results and no idea what to do with them, turn to Bayesian Statistics the Fun Way to get the most value from your data.

About the book

4.14/5 on Goodreads

ISBN 9781593279561

Published in 2019

256 pages

Beyond Spreadsheets with R

A beginner's guide to R and RStudio

by Dr. Jonathan Carroll

Beyond Spreadsheets with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. You’ll build on simple programming techniques like loops and conditionals to create your own custom functions. You’ll come away with a toolkit of strategies for analyzing and visualizing data of all sorts using R and RStudio.

About the book

4.4/5 on Goodreads

ISBN 9781617294594

Published in 2018

352 pages

Big Data

Principles and best practices of scalable realtime data systems

by Nathan Marz and James Warren

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.

About the book

3.84/5 on Goodreads

ISBN 9781617290343

Published in 2015

328 pages

Build a Career in Data Science

by Emily Robinson and Jacqueline Nolis

You are going to need more than technical knowledge to succeed as a data scientist.

Build a Career in Data Science teaches you what school leaves out, from how to land your first job to the lifecycle of a data science project, and even how to become a manager.

About the book

4.43/5 on Goodreads

ISBN 9781617296246

Published in 2020

354 pages

Building an Event-Driven Data Mesh

Patterns for Designing & Building Event-Driven Architectures

by Adam Bellemare

The exponential growth of data combined with the need to derive real-time business value is a critical issue today. An event-driven data mesh can power real-time operational and analytical workloads, all from a single set of data product streams. With practical real-world examples, this book shows you how to successfully design and build an event-driven data mesh.

Building an Event-Driven Data Mesh provides:

Practical tips for iteratively building your own event-driven data mesh, including hurdles you'll experience, possible solutions, and how to obtain real value as soon as possible
Solutions to pitfalls you may encounter when moving your organization from monoliths to event-driven architectures
A clear understanding of how events relate to systems and other events in the same stream and across streams
A realistic look at event modeling options, such as fact, delta, and command type events, including how these choices will impact your data products
Best practices for handling events at scale, privacy, and regulatory compliance
Advice on asynchronous communication and handling eventual consistency

About the book

4/5 on Goodreads

ISBN 9781098127602

Published in 2023

259 pages

O'Reilly Media

Building ETL Pipelines with Python

Create and deploy enterprise-ready ETL pipelines by employing modern methods

by Brij Kishore Pandey and Emily Ro Schoof

Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing.

In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments.

By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.

What you will learn

Explore the available libraries and tools to create ETL pipelines using Python
Write clean and resilient ETL code in Python that can be extended and easily scaled
Understand the best practices and design principles for creating ETL pipelines
Orchestrate the ETL process and scale the ETL pipeline effectively
Discover tools and services available in AWS for ETL pipelines
Understand different testing strategies and implement them with the ETL process

Who this book is for

If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.

About the book

0/5 on Goodreads

ISBN 9781804615256

Published in 2023

246 pages

Building Knowledge Graphs

A Practitioner's Guide

by Jesus Barrasa and Jim Webber

Incredibly useful, knowledge graphs help organizations keep track of medical research, cybersecurity threat intelligence, GDPR compliance, web user engagement, and much more. They do so by storing interlinked descriptions of entities--objects, events, situations, or abstract concepts---and encoding the underlying information. How do you create a knowledge graph? And how do you move it from theory into production?

Using hands-on examples, this practical book shows data scientists and data engineers how to build their own knowledge graphs. Authors Jesus Barrasa and Jim Webber from Neo4j illustrate common patterns for building knowledge graphs that solve many of today's pressing knowledge management problems. You'll quickly discover how these graphs become increasingly useful as you add data and augment them with algorithms and machine learning.

Learn the organizing principles necessary to build a knowledge graph
Explore how graph databases serve as a foundation for knowledge graphs
Understand how to import structured and unstructured data into your graph
Follow examples to build integration-and-search knowledge graphs
Learn what pattern detection knowledge graphs help you accomplish
Explore dependency knowledge graphs through examples
Use examples of natural language knowledge graphs and chatbots
Use graph algorithms and ML to gain insight into connected data

About the book

3.25/5 on Goodreads

ISBN 9781098127107

Published in 2023

288 pages

O'Reilly Media

Causal Inference in Python

Applying Causal Inference in the Tech Industry

by Matheus Facure

How many buyers will an additional dollar of online marketing bring in? Which customers will only buy when given a discount coupon? How do you establish an optimal pricing strategy? The best way to determine how the levers at our disposal affect the business metrics we want to drive is through causal inference.

In this book, author Matheus Facure, senior data scientist at Nubank, explains the largely untapped potential of causal inference for estimating impacts and effects. Managers, data scientists, and business analysts will learn classical causal inference methods like randomized control trials (A/B tests), linear regression, propensity score, synthetic controls, and difference-in-differences. Each method is accompanied by an application in the industry to serve as a grounding example.

With this book, you will:

Learn how to use basic concepts of causal inference
Frame a business problem as a causal inference problem
Understand how bias gets in the way of causal inference
Learn how causal effects can differ from person to person
Use repeated observations of the same customers across time to adjust for biases
Understand how causal effects differ across geographic locations
Examine noncompliance bias and effect dilution

About the book

4.33/5 on Goodreads

ISBN 9781098140250

Published in 2023

408 pages

O'Reilly Media

Collective Intelligence in Action

by Satnam Alag

There's a great deal of wisdom in a crowd, but how do you listen to a thousand people talking at once? Identifying the wants, needs, and knowledge of internet users can be like listening to a mob.

In the Web 2.0 era, leveraging the collective power of user contributions, interactions, and feedback is the key to market dominance. A new category of powerful programming techniques lets you discover the patterns, inter-relationships, and individual profiles—the collective intelligence—locked in the data people leave behind as they surf websites, post blogs, and interact with other users.

Collective Intelligence in Action is a hands-on guidebook for implementing collective-intelligence concepts using Java. It is the first Java-based book to emphasize the underlying algorithms and technical implementation of vital data gathering and mining techniques like analyzing trends, discovering relationships, and making predictions. It provides a pragmatic approach to personalization by combining content-based analysis with collaborative approaches.

About the book

3.78/5 on Goodreads

ISBN 9781933988313

Published in 2008

424 pages

Cracking the Data Engineering Interview

Land your dream job with the help of resume-building tips, over 100 mock questions, and a unique portfolio

by Kedeisha Bryan and Taamir Ransome

Preparing for a data engineering interview can often get overwhelming due to the abundance of tools and technologies, leaving you struggling to prioritize which ones to focus on. This hands-on guide provides you with the essential foundational and advanced knowledge needed to simplify your learning journey.

The book begins by helping you gain a clear understanding of the nature of data engineering and how it differs from organization to organization. As you progress through the chapters, you’ll receive expert advice, practical tips, and real-world insights on everything from creating a resume and cover letter to networking and negotiating your salary. The chapters also offer refresher training on data engineering essentials, including data modeling, database architecture, ETL processes, data warehousing, cloud computing, big data, and machine learning. As you advance, you’ll gain a holistic view by exploring continuous integration/continuous development (CI/CD), data security, and privacy. Finally, the book will help you practice case studies, mock interviews, as well as behavioral questions.

By the end of this book, you will have a clear understanding of what is required to succeed in an interview for a data engineering role.

What you will learn

Create maintainable and scalable code for unit testing
Understand the fundamental concepts of core data engineering tasks
Prepare with over 100 behavioral and technical interview questions
Discover data engineer archetypes and how they can help you prepare for the interview
Apply the essential concepts of Python and SQL in data engineering
Build your personal brand to noticeably stand out as a candidate

Who this book is for

If you’re an aspiring data engineer looking for guidance on how to land, prepare for, and excel in data engineering interviews, this book is for you. Familiarity with the fundamentals of data engineering, such as data modeling, cloud warehouses, programming (python and SQL), building data pipelines, scheduling your workflows (Airflow), and APIs, is a prerequisite.

About the book

0/5 on Goodreads

ISBN 9781837630776

Published in 2023

196 pages

Data Analysis with Python and PySpark

by Jonathan Rioux

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.

In Data Analysis with Python and PySpark you will learn how to:

Manage your data as it scales across multiple machines
Scale up your data programs with full confidence
Read and write data to and from a variety of sources and formats
Deal with messy data with PySpark’s data manipulation functionality
Discover new data sets and perform exploratory data analysis
Build automated data pipelines that transform, summarize, and get insights from data
Troubleshoot common PySpark errors
Creating reliable long-running jobs

Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.

About the book

4.32/5 on Goodreads

ISBN 9781617297205

Published in 2022

456 pages

Data Engineering on Azure

by Vlad Riscutia

Build a data platform to the industry-leading standards set by Microsoft’s own infrastructure.

In Data Engineering on Azure you will learn how to:

Pick the right Azure services for different data scenarios
Manage data inventory
Implement production quality data modeling, analytics, and machine learning workloads
Handle data governance
Using DevOps to increase reliability
Ingesting, storing, and distributing data
Apply best practices for compliance and access control

Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft’s own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning.

About the book

3.5/5 on Goodreads

ISBN 9781617298929

Published in 2021

336 pages

Data Engineering with AWS, 2nd Edition

by Gareth Eagar

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability.

You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS.

By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!

What you will learn

Seamlessly ingest streaming data with Amazon Kinesis Data Firehose
Optimize, denormalize, and join datasets with AWS Glue Studio
Use Amazon S3 events to trigger a Lambda process to transform a file
Load data into a Redshift data warehouse and run queries with ease
Visualize and explore data using Amazon QuickSight
Extract sentiment data from a dataset using Amazon Comprehend
Build transactional data lakes using Apache Iceberg with Amazon Athena
Learn how a data mesh approach can be implemented on AWS

Who this book is for

This book is for data engineers, data analysts, and data architects who are new to AWS and looking to extend their skills to the AWS cloud. Anyone new to data engineering who wants to learn about the foundational concepts, while gaining practical experience with common data engineering services on AWS, will also find this book useful. A basic understanding of big data-related topics and Python coding will help you get the most out of this book, but it’s not a prerequisite. Familiarity with the AWS console and core services will also help you follow along.

About the book

0/5 on Goodreads

ISBN 9781804614426

Published in 2023

636 pages

Data Engineering with dbt

A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

by Roberto Zagni

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.

This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work.

By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

What you will learn

Create a dbt Cloud account and understand the ELT workflow
Combine Snowflake and dbt for building modern data engineering pipelines
Use SQL to transform raw data into usable data, and test its accuracy
Write dbt macros and use Jinja to apply software engineering principles
Test data and transformations to ensure reliability and data quality
Build a lightweight pragmatic data platform using proven patterns
Write easy-to-maintain idempotent code using dbt materialization

Who this book is for

This book is for data engineers, analytics engineers, BI professionals, and data analysts who want to learn how to build simple, futureproof, and maintainable data platforms in an agile way. Project managers, data team managers, and decision makers looking to understand the importance of building a data platform and foster a culture of high-performing data teams will also find this book useful. Basic knowledge of SQL and data modeling will help you get the most out of the many layers of this book. The book also includes primers on many data-related subjects to help juniors get started.

About the book

4/5 on Goodreads

ISBN 9781803246284

Published in 2023

578 pages

Data Engineering with Google Cloud Platform

A practical guide to operationalizing scalable data analytics systems on GCP

by Adi Wijaya

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.

Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.

By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

What you will learn

Load data into BigQuery and materialize its output for downstream consumption
Build data pipeline orchestration using Cloud Composer
Develop Airflow jobs to orchestrate and automate a data warehouse
Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster
Leverage Pub/Sub for messaging and ingestion for event-driven systems
Use Dataflow to perform ETL on streaming data
Unlock the power of your data with Data Studio
Calculate the GCP cost estimation for your end-to-end data solutions

Who this book is for

This book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You'll find this book useful if you are preparing to take Google's Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

About the book

4.25/5 on Goodreads

ISBN 9781800561328

Published in 2022

440 pages

Data Governance: The Definitive Guide

People, Processes, and Tools to Operationalize Data Trustworthiness

by Evren Eryurek, Uri Gilad, Valliappa Lakshmanan, Anita Kibunguchy-Grant and Jessi Ashdown

As you move data to the cloud, you need to consider a comprehensive approach to data governance, along with well-defined and agreed-upon policies to ensure your organization meets compliance requirements. Data governance incorporates the ways people, processes, and technology work together to ensure data is trustworthy and can be used effectively. This practical guide shows you how to effectively implement and scale data governance throughout your organization.

Chief information, data, and security officers and their teams will learn strategy and tooling to support democratizing data and unlocking its value while enforcing security, privacy, and other governance standards. Through good data governance, you can inspire customer trust, enable your organization to identify business efficiencies, generate more competitive offerings, and improve customer experience. This book shows you how.

You'll learn:

Data governance strategies addressing people, processes, and tools
Benefits and challenges of a cloud-based data governance approach
How data governance is conducted from ingest to preparation and use
How to handle the ongoing improvement of data quality
Challenges and techniques in governing streaming data
Data protection for authentication, security, backup, and monitoring
How to build a data culture in your organization

About the book

3.79/5 on Goodreads

ISBN 9781492063490

Published in 2021

251 pages

O'Reilly Media

Data Management at Scale, 2nd Edition

Modern Data Architecture with Data Mesh and Data Fabric

by Piethein Strengholt

As data management continues to evolve rapidly, managing all of your data in a central place, such as a data warehouse, is no longer scalable. Today's world is about quickly turning data into value. This requires a paradigm shift in the way we federate responsibilities, manage data, and make it available to others. With this practical book, you'll learn how to design a next-gen data architecture that takes into account the scale you need for your organization.

Executives, architects and engineers, analytics teams, and compliance and governance staff will learn how to build a next-gen data landscape. Author Piethein Strengholt provides blueprints, principles, observations, best practices, and patterns to get you up to speed.

Examine data management trends, including regulatory requirements, privacy concerns, and new developments such as data mesh and data fabric
Go deep into building a modern data architecture, including cloud data landing zones, domain-driven design, data product design, and more
Explore data governance and data security, master data management, self-service data marketplaces, and the importance of metadata

About the book

3.88/5 on Goodreads

ISBN 9781098138868

Published in 2023

409 pages

O'Reilly Media

Data Mesh

Delivering Data-Driven Value at Scale

by Zhamak Dehghani

We're at an inflection point in data, where our data management solutions no longer match the complexity of organizations, the proliferation of data sources, and the scope of our aspirations to get value from data with AI and analytics. In this practical book, author Zhamak Dehghani introduces data mesh, a decentralized sociotechnical paradigm drawn from modern distributed architecture that provides a new approach to sourcing, sharing, accessing, and managing analytical data at scale.

Dehghani guides practitioners, architects, technical leaders, and decision makers on their journey from traditional big data architecture to a distributed and multidimensional approach to analytical data management. Data mesh treats data as a product, considers domains as a primary concern, applies platform thinking to create self-serve data infrastructure, and introduces a federated computational model of data governance.

Get a complete introduction to data mesh principles and its constituents
Design a data mesh architecture
Guide a data mesh strategy and execution
Navigate organizational design to a decentralized data ownership model
Move beyond traditional data warehouses and lakes to a distributed data mesh

About the book

3.84/5 on Goodreads

ISBN 9781492092391

Published in 2022

384 pages

O'Reilly Media

Data Mesh in Action

by Jacek Majchrzak, Sven Balnojan, Marian Siwiak and Mariusz Sieraczkiewicz

Revolutionize the way your organization approaches data with a data mesh! This new decentralized architecture outpaces monolithic lakes and warehouses and can work for a company of any size.

In Data Mesh in Action you will learn how to:

Implement a data mesh in your organization
Turn data into a data product
Move from your current data architecture to a data mesh
Identify data domains, and decompose an organization into smaller, manageable domains
Set up the central governance and local governance levels over data
Balance responsibilities between the two levels of governance
Establish a platform that allows efficient connection of distributed data products and automated governance

Data Mesh in Action reveals how this groundbreaking architecture looks for both startups and large enterprises. You won’t need any new technology—this book shows you how to start implementing a data mesh with flexible processes and organizational change. You’ll explore both an extended case study and real-world examples. As you go, you’ll be expertly guided through discussions around Socio-Technical Architecture and Domain-Driven Design with the goal of building a sleek data-as-a-product system. Plus, dozens of workshop techniques for both in-person and remote meetings help you onboard colleagues and drive a successful transition.

About the book

3.69/5 on Goodreads

ISBN 9781633439979

Published in 2022

328 pages

Data Modeling with Snowflake

A practical guide to accelerating Snowflake development using universal data modeling techniques

by Serge Gershkovich

The Snowflake Data Cloud is one of the fastest-growing platforms for data warehousing and application workloads. Snowflake's scalable, cloud-native architecture and expansive set of features and objects enables you to deliver data solutions quicker than ever before.

Yet, we must ensure that these solutions are developed using recommended design patterns and accompanied by documentation that’s easily accessible to everyone in the organization.

This book will help you get familiar with simple and practical data modeling frameworks that accelerate agile design and evolve with the project from concept to code. These universal principles have helped guide database design for decades, and this book pairs them with unique Snowflake-native objects and examples like never before – giving you a two-for-one crash course in theory as well as direct application.

By the end of this Snowflake book, you’ll have learned how to leverage Snowflake’s innovative features, such as time travel, zero-copy cloning, and change-data-capture, to create cost-effective, efficient designs through time-tested modeling principles that are easily digestible when coupled with real-world examples.

What you will learn

Discover the time-saving benefits and applications of data modeling
Learn about Snowflake’s cloud-native architecture and its features
Understand and apply modeling techniques using Snowflake objects
Universal modeling concepts and language through Snowflake objects
Get comfortable reading and transforming semistructured data
Learn directly with pre-built recipes and examples
Learn to apply modeling frameworks from Star to Data Vault

Who this book is for

This book is for developers working with SQL who are looking to build a strong foundation in modeling best practices and gain an understanding of where they can be effectively applied to save time and effort. Whether you’re an ace in SQL logic or starting out in database design, this book will equip you with the practical foundations of data modeling to guide you on your data journey with Snowflake. Developers who’ve recently discovered Snowflake will be able to uncover its core features and learn to incorporate them into universal modeling frameworks.

About the book

4.5/5 on Goodreads

ISBN 9781837634453

Published in 2023

324 pages

Data Pipelines Pocket Reference

Moving and Processing Dta for Analytics

by James Densmore

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack.

You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions.

You'll learn:

What a data pipeline is and how it works
How data is moved and processed on modern data infrastructure, including cloud platforms
Common tools and products used by data engineers to build pipelines
How pipelines support analytics and reporting needs
Considerations for pipeline maintenance, testing, and alerting

About the book

3.73/5 on Goodreads

ISBN 9781492087830

Published in 2021

274 pages

O'Reilly Media

Data Pipelines with Apache Airflow

by Bas P. Harenslak and Julian Rutger de Ruiter

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples,

Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

About the book

4.39/5 on Goodreads

ISBN 9781617296901

Published in 2021

480 pages

Data Quality Fundamentals

A Practitioner's Guide to Building Trustworthy Data Pipelines

by Barr Moses, Lior Gavish and Molly Vorwerck

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you.

Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.

Build more trustworthy and reliable data pipelines
Write scripts to make data checks and identify broken pipelines with data observability
Learn how to set and maintain data SLAs, SLIs, and SLOs
Develop and lead data quality initiatives at your company
Learn how to treat data services and systems with the diligence of production software
Automate data lineage graphs across your data ecosystem
Build anomaly detectors for your critical data assets

About the book

3.75/5 on Goodreads

ISBN 9781098112042

Published in 2022

308 pages

O'Reilly Media

Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

by EMC Education Services

Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software.

This book will help you:

Become a contributor on a data science team
Deploy a structured lifecycle approach to data analytics problems
Apply appropriate analytic techniques and tools to analyzing big data
Learn how to tell a compelling story with data to drive business action
Prepare for EMC Proven Professional Data Science Certification

Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today!

About the book

3.89/5 on Goodreads

ISBN 9781118876138

Published in 2015

432 pages

Wiley

Data Science Bookcamp

Five real-world Python projects

by Leonard Apeltsin

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.

In Data Science Bookcamp you will learn:

Techniques for computing and plotting probabilities
Statistical analysis using Scipy
How to organize datasets with clustering algorithms
How to visualize complex multi-variable datasets
How to train a decision tree machine learning algorithm

In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

About the book

4.5/5 on Goodreads

ISBN 9781617296253

Published in 2021

704 pages

Data Science for Business

What You Need to Know About Data Mining and Data-Analytic Thinking

by Foster Provost and Tom Fawcett

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the "data-analytic thinking" necessary for extracting useful knowledge and business value from the data you collect. This guide also helps you understand the many data-mining techniques in use today.

Based on an MBA course Provost has taught at New York University over the past ten years, Data Science for Business provides examples of real-world business problems to illustrate these principles. You’ll not only learn how to improve communication between business stakeholders and data scientists, but also how participate intelligently in your company’s data science projects. You’ll also discover how to think data-analytically, and fully appreciate how data science methods can support business decision-making.

Understand how data science fits in your organization—and how you can use it for competitive advantage
Treat data as a business asset that requires careful investment if you’re to gain real value
Approach business problems data-analytically, using the data-mining process to gather good data in the most appropriate way
Learn general concepts for actually extracting knowledge from data
Apply data science principles when interviewing data science job candidates

About the book

4.13/5 on Goodreads

ISBN 9781449361327

Published in 2013

413 pages

O'Reilly Media

Data Science from Scratch, 2nd Edition

First Principles with Python

by Joel Grus

To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, and toolkits—but also understand the ideas and principles underlying them. Updated for Python 3.6, this second edition of Data Science from Scratch shows you how these tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with the hacking skills you need to get started as a data scientist. Packed with new material on deep learning, statistics, and natural language processing, this updated book shows you how to find the gems in today’s messy glut of data.

Get a crash course in Python
Learn the basics of linear algebra, statistics, and probability—and how and when they’re used in data science
Collect, explore, clean, munge, and manipulate data
Dive into the fundamentals of machine learning
Implement models such as k-nearest neighbors, Naïve Bayes, linear and logistic regression, decision trees, neural networks, and clustering
Explore recommender systems, natural language processing, network analysis, MapReduce, and databases

About the book

3.91/5 on Goodreads

ISBN 9781492041139

Published in 2019

403 pages

O'Reilly Media

Data Science with Python and Dask

by Jesse C. Daniel

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!

About the book

3.75/5 on Goodreads

ISBN 9781617295607

Published in 2019

296 pages

Data Visualization with JavaScript

by Stephen A. Thomas

You’ve got data to communicate. But what kind of visualization do you choose, how do you build your visualizations, and how do you ensure that they're up to the demands of the Web?

In Data Visualization with JavaScript, you’ll learn how to use JavaScript, HTML, and CSS to build practical visualizations for your data. Step-by-step examples walk you through creating, integrating, and debugging different types of visualizations and you'll be building basic visualizations (like bar, line, and scatter graphs) in no time.

You'll also learn how to:

Create tree maps, heat maps, network graphs, word clouds, and timelines
Map geographic data, and build sparklines and composite charts
Add interactivity and retrieve data with AJAX
Manage data in the browser and build data-driven web applications
Harness the power of the Flotr2, Flot, Chronoline.js, D3.js, Underscore.js, and Backbone.js libraries

If you already know your way around building a web page but aren’t quite sure how to build a good visualization, Data Visualization with JavaScript will help you get your feet wet without throwing you into the deep end. You’ll soon be well on your way to creating simple, powerful data visualizations.

Download the source code

About the book

3.8/5 on Goodreads

ISBN 9781593276058

Published in 2015

384 pages

Data Wrangling with JavaScript

by Ashley Davis

Data Wrangling with JavaScript is hands-on guide that will teach you how to create a JavaScript-based data processing pipeline, handle common and exotic data, and master practical troubleshooting strategies.

About the book

4.47/5 on Goodreads

ISBN 9781617294846

Published in 2018

432 pages

Deciphering Data Architectures

Database Internals

A Deep-Dive into How Distributed Data Systems Work

by Alex Petrov

When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it’s often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals.

Throughout the book, you’ll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You’ll discover that the most significant distinctions among many modern databases reside in subsystems that determine how storage is organized and how data is distributed.

This book examines:

Storage engines: Explore storage classification and taxonomy, and dive into B-Tree-based and immutable Log Structured storage engines, with differences and use-cases for each
Storage building blocks: Learn how database files are organized to build efficient storage, using auxiliary data structures such as Page Cache, Buffer Pool and Write-Ahead Log
Distributed systems: Learn step-by-step how nodes and processes connect and build complex communication patterns
Database clusters: Which consistency models are commonly used by modern databases and how distributed storage systems achieve consistency

About the book

4.26/5 on Goodreads

ISBN 9781492040347

Published in 2019

370 pages

O'Reilly Media

Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh

by James Serra

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of these architectures to help data professionals understand the pros and cons of each. James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You'll learn what data lakehouses can help you achieve, as well as how to distinguish data mesh hype from reality. Best of all, you'll be able to determine the most appropriate data architecture for your needs. With this book, you'll:

Gain a working understanding of several data architectures
Learn the strengths and weaknesses of each approach
Distinguish data architecture theory from reality
Pick the best architecture for your use case
Understand the differences between data warehouses and data lakes
Learn common data architecture concepts to help you build better solutions
Explore the historical evolution and characteristics of data architectures
Learn essentials of running an architecture design session, team organization, and project success factors

Free from product discussions, this book will serve as a timeless resource for years to come.

About the book

5/5 on Goodreads

ISBN 9781098150761

Published in 2024

278 pages

O'Reilly Media

Deep Learning for Natural Language Processing

by Stephan Raaijmakers

Explore the most challenging issues of natural language processing, and learn how to solve them with cutting-edge deep learning!

Inside Deep Learning for Natural Language Processing you’ll find a wealth of NLP insights, including:

An overview of NLP and deep learning
One-hot text representations
Word embeddings
Models for textual similarity
Sequential NLP
Semantic role labeling
Deep memory-based NLP
Linguistic structure
Hyperparameters for deep NLP

Deep learning has advanced natural language processing to exciting new levels and powerful new applications! For the first time, computer systems can achieve "human" levels of summarizing, making connections, and other tasks that require comprehension and context.

Deep Learning for Natural Language Processing reveals the groundbreaking techniques that make these innovations possible. Stephan Raaijmakers distills his extensive knowledge into useful best practices, real-world applications, and the inner workings of top NLP algorithms.

About the book

4/5 on Goodreads

ISBN 9781617295447

Published in 2022

296 pages

Definitive Guide to DAX, 2nd Edition

The Business intelligence for Microsoft Power BI, SQL Server Analysis Services, and Excel

by Alberto Ferrari and Marco Russo

Now expanded and updated with modern best practices, this is the most complete guide to Microsoft's DAX language for business intelligence, data modeling, and analytics. Expert Microsoft BI consultants Marco Russo and Alberto Ferrari help you master everything from table functions through advanced code and model optimization. You'll learn exactly what happens under the hood when you run a DAX expression, and use this knowledge to write fast, robust code. This edition focuses on examples you can build and run with the free Power BI Desktop, and helps you make the most of the powerful syntax of variables (VAR) in Power BI, Excel, or Analysis Services. Want to leverage all of DAX's remarkable capabilities? This no-compromise "deep dive" is exactly what you need.

Perform powerful data analysis with DAX for Power BI, SQL Server, and Excel

Master core DAX concepts, including calculated columns, measures, and calculation groups
Work efficiently with basic and advanced table functions
Understand evaluation contexts and the CALCULATE and CALCULATETABLE functions
Perform time-based calculations
Use calculation groups and calculation items
Use syntax of variables (VAR) to write more readable, maintainable code
Express diverse and unusual relationships with DAX, including many-to-many relationships and bidirectional filters
Master advanced optimization techniques, and improve performance in aggregations
Optimize data models to achieve better compression
Measure DAX query performance with DAX Studio and learn how to optimize your DAX

About the book

4.49/5 on Goodreads

ISBN 9781509306978

Published in 2019

768 pages

Microsoft Press

Delta Lake: Up and Running

Modern Data Lakehouse Architectures with Delta Lake

by Bennie Haelen and Dan Davis

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS.

This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights.

You'll learn how to:

Use modern data management and data engineering techniques
Understand how ACID transactions bring reliability to data lakes at scale
Run streaming and batch jobs against your data lake concurrently
Execute update, delete, and merge commands against your data lake
Use time travel to roll back and examine previous data versions
Build a streaming data quality pipeline following the medallion architecture

About the book

4/5 on Goodreads

ISBN 9781098139728

Published in 2023

264 pages

O'Reilly Media

Demand Forecasting Best Practices

by Nicolas Vandeput

Lead your demand planning process to excellence and deliver real value to your supply chain.

In Demand Forecasting Best Practices you’ll learn how to:

Lead your team to improve quality while reducing workload
Properly define the objectives and granularity of your demand planning
Use intelligent KPIs to track accuracy and bias
Identify areas for process improvement
Help planners and stakeholders add value
Determine relevant data to collect and how best to collect it
Utilize different statistical and machine learning models

An expert demand forecaster can help an organization avoid overproduction, reduce waste, and optimize inventory levels for a real competitive advantage. Demand Forecasting Best Practices teaches you how to become that virtuoso demand forecaster.

This one-of-a-kind guide reveals forecasting tools, metrics, models, and stakeholder management techniques for delivering more effective supply chains. Everything you learn has been proven and tested in a live business environment. Discover author Nicolas Vandeput’s original five step framework for demand planning excellence and learn how to tailor it to your own company’s needs. Illustrations and real-world examples make each concept easy to understand and easy to follow. You’ll soon be delivering accurate predictions that are driving major business value.

About the book

4.36/5 on Goodreads

ISBN 9781633438095

Published in 2023

216 pages

Designing Cloud Data Platforms

by Danil Zburivsky and Lynda Partner

Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services.

Designing Cloud Data Platforms is a hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you'll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You'll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyze it.

About the book

4.46/5 on Goodreads

ISBN 9781617296444

Published in 2021

336 pages

Designing Distributed Systems

Patterns and Paradigms for Scalable, Reliable Services

by Brendan Burns

Without established design patterns to guide them, developers have had to build distributed systems from scratch, and most of these systems are very unique indeed. Today, the increasing use of containers has paved the way for core distributed system patterns and reusable containerized components. This practical guide presents a collection of repeatable, generic patterns to help make the development of reliable distributed systems far more approachable and efficient.

Author Brendan Burns—Director of Engineering at Microsoft Azure—demonstrates how you can adapt existing software design patterns for designing and building reliable distributed applications. Systems engineers and application developers will learn how these long-established patterns provide a common language and framework for dramatically increasing the quality of your system.

Understand how patterns and reusable components enable the rapid development of reliable distributed systems
Use the side-car, adapter, and ambassador patterns to split your application into a group of containers on a single machine
Explore loosely coupled multi-node distributed patterns for replication, scaling, and communication between the components
Learn distributed system patterns for large-scale batch data processing covering work-queues, event-based processing, and coordinated workflows

About the book

3.58/5 on Goodreads

ISBN 9781491983645

Published in 2018

162 pages

O'Reilly Media

Dive Into Data Science

Use Python to Tackle Your Toughest Business Challenges

by Bradford Tuckfield

Dive into the exciting world of data science with this practical introduction. Packed with essential skills and useful examples, Dive Into Data Science will show you how to obtain, analyze, and visualize data so you can leverage its power to solve common business challenges.

With only a basic understanding of Python and high school math, you’ll be able to effortlessly work through the book and start implementing data science in your day-to-day work. From improving a bike sharing company to extracting data from websites and creating recommendation systems, you’ll discover how to find and use data-driven solutions to make business decisions.

Topics covered include conducting exploratory data analysis, running A/B tests, performing binary classification using logistic regression models, and using machine learning algorithms.

You’ll also learn how to:

Forecast consumer demand
Optimize marketing campaigns
Reduce customer attrition
Predict website traffic
Build recommendation systems

With this practical guide at your fingertips, harness the power of programming, mathematical theory, and good old common sense to find data-driven solutions that make a difference. Don’t wait; dive right in!

About the book

4.5/5 on Goodreads

ISBN 9781718502888

Published in 2023

288 pages

Doing Math with Python

Use Programming to Explore Algebra, Statistics, Calculus, and More!

by Amit Saha

Doing Math with Python shows you how to use Python to delve into high school–level math topics like statistics, geometry, probability, and calculus. You’ll start with simple projects, like a factoring program and a quadratic-equation solver, and then create more complex projects once you’ve gotten the hang of things.

Along the way, you’ll discover new ways to explore math and gain valuable programming skills that you’ll use throughout your study of math and computer science. Learn how to:

Describe your data with statistics, and visualize it with line graphs, bar charts, and scatter plots
Explore set theory and probability with programs for coin flips, dicing, and other games of chance
Solve algebra problems using Python’s symbolic math functions
Draw geometric shapes and explore fractals like the Barnsley fern, the Sierpinski triangle, and the Mandelbrot set
Write programs to find derivatives and integrate functions

Creative coding challenges and applied examples help you see how you can put your new math and coding skills into practice. You’ll write an inequality solver, plot gravity’s effect on how far a bullet will travel, shuffle a deck of cards, estimate the area of a circle by throwing 100,000 “darts” at a board, explore the relationship between the Fibonacci sequence and the golden ratio, and more.

Whether you’re interested in math but have yet to dip into programming or you’re a teacher looking to bring programming into the classroom, you’ll find that Python makes programming easy and practical. Let Python handle the grunt work while you focus on the math.

Uses Python 3

About the book

4.07/5 on Goodreads

ISBN 9781593276409

Published in 2015

264 pages

Essential Math for Data Science

Take Control of Your Data with Fundamental Linear Algebra, Probability, and Statistics

by Thomas Nield

Master the math needed to excel in data science, machine learning, and statistics. In this book author Thomas Nield guides you through areas like calculus, probability, linear algebra, and statistics and how they apply to techniques like linear regression, logistic regression, and neural networks. Along the way you'll also gain practical insights into the state of data science and how to use those insights to maximize your career.

Learn how to:

Use Python code and libraries like SymPy, NumPy, and scikit-learn to explore essential mathematical concepts like calculus, linear algebra, statistics, and machine learning
Understand techniques like linear regression, logistic regression, and neural networks in plain English, with minimal mathematical notation and jargon
Perform descriptive statistics and hypothesis testing on a dataset to interpret p-values and statistical significance
Manipulate vectors and matrices and perform matrix decomposition
Integrate and build upon incremental knowledge of calculus, probability, statistics, and linear algebra, and apply it to regression models including neural networks
Navigate practically through a data science career and avoid common pitfalls, assumptions, and biases while tuning your skill set to stand out in the job market

About the book

4.1/5 on Goodreads

ISBN 9781098102937

Published in 2022

347 pages

O'Reilly Media

Fighting Churn with Data

The science and strategy of customer retention

by Carl S. Gold

The beating heart of any product or service business is returning clients. Don't let your hard-won customers vanish, taking their money with them. In Fighting Churn with Data you'll learn powerful data-driven techniques to maximize customer retention and minimize actions that cause them to stop engaging or unsubscribe altogether. This hands-on guide is packed with techniques for converting raw data into measurable metrics, testing hypotheses, and presenting findings that are easily understandable to non-technical decision makers.

About the book

4.55/5 on Goodreads

ISBN 9781617296529

Published in 2020

504 pages

Fundamentals of Data Engineering

Plan and Build Robust Data Systems

by Joe Reis and Matt Housley

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:

Get a concise overview of the entire data engineering landscape
Assess data engineering problems using an end-to-end framework of best practices
Cut through marketing hype when choosing data technologies, architecture, and processes
Use the data engineering lifecycle to design and build a robust architecture
Incorporate data governance and security across the data engineering lifecycle

About the book

4.27/5 on Goodreads

ISBN 9781098108304

Published in 2022

447 pages

O'Reilly Media

Fundamentals of Data Observability

Implement Trustworthy End-to-End Data Solutions

by Andy Petrella

Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practices that enables data teams to gain greater visibility of data and its usage. If you're a data engineer, data architect, or machine learning engineer who depends on the quality of your data, this book shows you how to focus on the practical aspects of introducing data observability in your everyday work.

Author Andy Petrella helps you build the right habits to identify and solve data issues, such as data drifts and poor quality, so you can stop their propagation in data applications, pipelines, and analytics. You'll learn ways to introduce data observability, including setting up a framework for generating and collecting all the information you need.

Learn the core principles and benefits of data observability
Use data observability to detect, troubleshoot, and prevent data issues
Follow the book's recipes to implement observability in your data projects
Use data observability to create a trustworthy communication framework with data consumers
Learn how to educate your peers about the benefits of data observability

About the book

3.5/5 on Goodreads

ISBN 9781098133290

Published in 2023

264 pages

O'Reilly Media

Fundamentals of Data Visualization

A Primer on Making Informative and Compelling Figures

by Claus O. Wilke

Effective visualization is the best way to communicate information from the increasingly large and complex datasets in the natural and social sciences. But with the increasing power of visualization software today, scientists, engineers, and business analysts often have to navigate a bewildering array of visualization choices and options.

This practical book takes you through many commonly encountered visualization problems, and it provides guidelines on how to turn large datasets into clear and compelling figures. What visualization type is best for the story you want to tell? How do you make informative figures that are visually pleasing? Author Claus O. Wilke teaches you the elements most critical to successful data visualization.

Explore the basic concepts of color as a tool to highlight, distinguish, or represent a value
Understand the importance of redundant coding to ensure you provide key information in multiple ways
Use the book’s visualizations directory, a graphical guide to commonly used types of data visualizations
Get extensive examples of good and bad figures
Learn how to use figures in a document or report and how employ them effectively to tell a compelling story

About the book

4.44/5 on Goodreads

ISBN 9781492031086

Published in 2019

387 pages

O'Reilly Media

Fundamentals of Database Management Systems, 3rd Edition

by Mark L. Gillenson

In the newly revised third edition of Fundamentals of Database Management Systems, veteran database expert Dr. Mark Gillenson delivers an authoritative and comprehensive account of contemporary database management. The Third Edition assists readers in understanding critical topics in the subject, including data modeling, relational database concepts, logical and physical database design, SQL, data administration, data security, NoSQL, blockchain, database in the cloud, and more.

The author offers a firm grounding in the fundamentals of database while, at the same time, providing a wide-ranging survey of database subfields relevant to information systems professionals. And, now included in the supplements, the author's audio narration of the included PowerPoint slides! Readers will also find:

Brand-new content on NoSQL database management, NewSQL, blockchain, and database-intensive applications, including data analytics, ERP, CRM, and SCM
Updated and revised narrative material designed to offer a friendly introduction to database management
Renewed coverage of cloud-based database management
Extensive updates to incorporate the transition from rotating disk secondary storage to solid state drives

About the book

3.42/5 on Goodreads

ISBN 9798223934356

Published in 2011

413 pages

Wiley

Geoprocessing with Python

by Chris Garrard

Geoprocessing with Python teaches you how to use the Python programming language, along with free and open source tools, to read, write, and process geospatial data.

About the book

4.5/5 on Goodreads

ISBN 9781617292149

Published in 2016

360 pages

Getting Started with Natural Language Processing

by Ekaterina Kochmar

Hit the ground running with this in-depth introduction to the NLP skills and techniques that allow your computers to speak human.

In Getting Started with Natural Language Processing you’ll learn about:

Fundamental concepts and algorithms of NLP
Useful Python libraries for NLP
Building a search algorithm
Extracting information from raw text
Predicting sentiment of an input text
Author profiling
Topic labeling
Named entity recognition

Getting Started with Natural Language Processing is an enjoyable and understandable guide that helps you engineer your first NLP algorithms. Your tutor is Dr. Ekaterina Kochmar, lecturer at the University of Bath, who has helped thousands of students take their first steps with NLP. Full of Python code and hands-on projects, each chapter provides a concrete example with practical techniques that you can put into practice right away. If you’re a beginner to NLP and want to upgrade your applications with functions and features like information extraction, user profiling, and automatic topic labeling, this is the book for you.

About the book

3.89/5 on Goodreads

ISBN 9781617296765

Published in 2022

456 pages

Gnuplot in Action, 2nd Edition

Understanding data with graphs

by Philipp K. Janert

Gnuplot in Action, Second Edition is a major revision of this popular and authoritative guide for developers, engineers, and scientists who want to learn and use gnuplot effectively. Fully updated for gnuplot version 5, the book includes four pages of color illustrations and four bonus appendixes available in the eBook.

About the book

4.33/5 on Goodreads

ISBN 9781633430181

Published in 2016

400 pages

Graph Algorithms for Data Science

With examples in Neo4j

by Tomaž Bratanič

Practical methods for analyzing your data with graphs, revealing hidden connections and new insights.

Graphs are the natural way to represent and understand connected data. This book explores the most important algorithms and techniques for graphs in data science, with concrete advice on implementation and deployment. You don’t need any graph experience to start benefiting from this insightful guide. These powerful graph algorithms are explained in clear, jargon-free text and illustrations that makes them easy to apply to your own projects.

In Graph Algorithms for Data Science you will learn:

Labeled-property graph modeling
Constructing a graph from structured data such as CSV or SQL
NLP techniques to construct a graph from unstructured data
Cypher query language syntax to manipulate data and extract insights
Social network analysis algorithms like PageRank and community detection
How to translate graph structure to a ML model input with node embedding models
Using graph features in node classification and link prediction workflows

Graph Algorithms for Data Science is a hands-on guide to working with graph-based data in applications like machine learning, fraud detection, and business data analysis. It’s filled with fascinating and fun projects, demonstrating the ins-and-outs of graphs. You’ll gain practical skills by analyzing Twitter, building graphs with NLP techniques, and much more.

About the book

0/5 on Goodreads

ISBN 9781617299469

Published in 2024

352 pages

Graph Databases in Action

Examples in Gremlin

by Dave Bechberger and Josh Perryman

Relationships in data often look far more like a web than an orderly set of rows and columns. Graph databases shine when it comes to revealing valuable insights within complex, interconnected data such as demographics, financial records, or computer networks. In

Graph Databases in Action, experts Dave Bechberger and Josh Perryman illuminate the design and implementation of graph databases in real-world applications. You'll learn how to choose the right database solutions for your tasks, and how to use your new knowledge to build agile, flexible, and high-performing graph-powered applications!

About the book

3.57/5 on Goodreads

ISBN 9781617296376

Published in 2020

336 pages

by Chuck Lam

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming.

About the book

3.68/5 on Goodreads

ISBN 9781935182191

Published in 2010

336 pages

Hadoop in Practice, 2nd Edition

by Alex Holmes

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere

About the book

3.83/5 on Goodreads

ISBN 9781617292224

Published in 2014

512 pages

Hadoop: The Definitive Guide, 4th Edition

Storage and Analysis at Internet Scale

by Tom White

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service

About the book

3.94/5 on Goodreads

ISBN 9781491901632

Published in 2015

754 pages

O'Reilly Media

Hands-On Data Preprocessing in Python

Learn how to effectively prepare data for successful data analytics

by Roy Jafari

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who's developed college-level courses on data preprocessing and related subjects.

With this book, you'll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data.

You'll learn about different technical and analytical aspects of data preprocessing - data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment.

The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you'll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.

What you will learn

Use Python to perform analytics functions on your data
Understand the role of databases and how to effectively pull data from databases
Perform data preprocessing steps defined by your analytics goals
Recognize and resolve data integration challenges
Identify the need for data reduction and execute it
Detect opportunities to improve analytics with data transformation

Who this book is for

This book is for junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data. You don't need any prior experience with data preprocessing to get started with this book. However, basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are a prerequisite.

About the book

1/5 on Goodreads

ISBN 9781801072137

Published in 2022

602 pages

HBase in Action

by Nicholas Dimiduk and Amandeep Khurana

HBase in Action has all the knowledge you need to design, build, and run applications using HBase. First, it introduces you to the fundamentals of distributed systems and large scale data handling. Then, you'll explore real-world applications and code samples with just enough theory to understand the practical techniques. You'll see how to build applications with HBase and take advantage of the MapReduce processing framework. And along the way you'll learn patterns and best practices.

About the book

3.84/5 on Goodreads

ISBN 9781617290527

Published in 2012

360 pages

Interactive Data Visualization for the Web, 2nd Edition

An Introduction to Designing with D3

by Scott Murray

Create and publish your own interactive data visualization projects on the webâ??even if you have little or no experience with data visualization or web development. Itâ??s inspiring and fun with this friendly, accessible, and practical hands-on introduction. This fully updated and expanded second edition takes you through the fundamental concepts and methods of D3, the most powerful JavaScript library for expressing data visually in a web browser.

Ideal for designers with no coding experience, reporters exploring data journalism, and anyone who wants to visualize and share data, this step-by-step guide will also help you expand your web programming skills by teaching you the basics of HTML, CSS, JavaScript, and SVG.

Learn D3 with downloadable code and over 140 examples
Create bar charts, scatter plots, pie charts, stacked bar charts, and force-directed graphs
Use smooth, animated transitions to show changes in your data
Introduce interactivity to help users explore your data
Create custom geographic maps with panning, zooming, labels, and tooltips
Walk through the creation of a complete visualization project, from start to finish
Explore inspiring case studies with nine accomplished designers talking about their D3-based projects

About the book

4.22/5 on Goodreads

ISBN 9781491921289

Published in 2017

472 pages

O'Reilly Media

Introducing Data Science

Big data, machine learning, and more, using Python tools

by Davy Cielen, Arno D. B. Meysman and Mohamed Ali

Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science.

About the book

3.24/5 on Goodreads

ISBN 9781633430037

Published in 2016

320 pages

Learn dbatools in a Month of Lunches

Automating SQL server tasks with PowerShell commands

by Chrissy LeMaire, Rob Sewell, Jess Pomfret and Cláudio Silva

If you work with SQL Server, dbatools is a lifesaver. This book will show you how to use this free and open source PowerShell module to automate just about every SQL server task you can imagine—all in just one month!

In Learn dbatools in a Month of Lunches you will learn how to:

Perform instance-to-instance and customized migrations
Automate security audits, tempdb configuration, alerting, and reporting
Schedule and monitor PowerShell tasks in SQL Server Agent
Bulk-import any type of data into SQL Server
Install dbatools in secure environments

Written by a group of expert authors including dbatools creator Chrissy LeMaire, Learn dbatools in a Month of Lunches teaches you techniques that will make you more effective—and efficient—than you ever thought possible. In twenty-eight lunchbreak lessons, you’ll learn the most important use cases of dbatools and the favorite functions of its core developers. Stabilize and standardize your SQL server environment, and simplify your tasks by building automation, alerting, and reporting with this powerful tool.

About the book

4.67/5 on Goodreads

ISBN 9781617296703

Published in 2022

400 pages

Learn PostgreSQL, 2nd Edition

Use, manage, and build secure and scalable databases with PostgreSQL 16

by Luca Ferrari and Enrico Pirozzi

The latest edition of this PostgreSQL book will help you to start using PostgreSQL from absolute scratch, helping you to quickly understand the internal workings of the database. With a structured approach and practical examples, go on a journey that covers the basics, from SQL statements and how to run server-side programs, to configuring, managing, securing, and optimizing database performance.

This new edition will not only help you get to grips with all the recent changes within the PostgreSQL ecosystem but will also dig deeper into concepts like partitioning and replication with a fresh set of examples. The book is also equipped with Docker images for each chapter which makes the learning experience faster and easier. Starting with the absolute basics of databases, the book sails through to advanced concepts like window functions, logging, auditing, extending the database, configuration, partitioning, and replication. It will also help you seamlessly migrate your existing database system to PostgreSQL and contains a dedicated chapter on disaster recovery. Each chapter ends with practice questions to test your learning at regular intervals.

By the end of this book, you will be able to install, configure, manage, and develop applications against a PostgreSQL database.

What you will learn

Gain a deeper understanding of PostgreSQL internals like transactions, MVCC, security and replication
Enhance data management with PostgreSQL’s latest partitioning features
Choose the right replication strategy for your database
See concrete examples of how to migrate data from another database, perform backups and restores, monitor your PostgreSQL installation and more
Ensure security and compliance with schemas and user privileges
Create customized database functions and extensions
Get to grips with server-side programming, window functions, and triggers

Who this book is for

Learning PostgresSQL 16 book is for anyone interested in learning about the PostgreSQL database from scratch. Anyone looking to build robust data warehousing applications and scale the database for high-availability and performance using the latest features of PostgreSQL will also find this book useful. Although prior knowledge of PostgreSQL is not required, familiarity with databases is expected.

About the book

4/5 on Goodreads

ISBN 9781837635641

Published in 2023

744 pages

Learn Python the Hard Way, 5th Edition

A Deceptively Simple Introduction to the Terrifyingly Beautiful World of Computers and Data Science

by Zed A. Shaw

Zed Shaw has created the world's most reliable system for learning Python. Follow it and you will succeed--just like the millions of beginners Zed has taught to date! You bring the discipline, persistence, and attention; the author supplies the masterful knowledge you need to succeed.

In Learn Python the Hard Way, Fifth Edition, you'll learn Python by working through 60 lovingly crafted exercises. Read them. Type in the code. Run it. Fix your mistakes. Repeat. As you do, you'll learn how a computer works, how to solve problems, and how to enjoy programming . . . even when it's driving you crazy.

Install a complete Python environment
Organize and write code
Fix and break code
Basic mathematics
Strings and text
Interact with users
Work with files
Looping and logic
Object-oriented programming
Data structures using lists and dictionaries
Modules, classes, and objects
Python packaging
Automated testing
Basic SQL for Data Science
Web scraping
Fixing bad data (munging)
The "Data" part of "Data Science"

It'll be frustrating at first. But if you keep trying, you'll get it--and it'll feel amazing! This course will reward you for every minute you put into it. Soon, you'll know one of the world's most powerful, popular programming languages. You'll be a Python programmer.

This Book Is Perfect For

Total beginners with zero programming experience
Junior developers who know one or two languages
Returning professionals who haven't written code in years
Aspiring Data Scientists or academics who need to learn to code
Seasoned professionals looking for a fast, simple crash course in Python for Data Science

Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

About the book

3.93/5 on Goodreads

ISBN 9780134692883

Published in 2024

352 pages

Addison-Wesley Professional

Learn SQL Server Administration in a Month of Lunches

by Don Jones

Learn SQL Server Administration in a Month of Lunches is the perfect way to get started with SQL Server operations, including maintenance, backup and recovery, high availability, and performance monitoring. In about an hour a day over a month, you'll learn exactly what you can do, and what you shouldn't touch. Most importantly, you'll learn the day-to-day tasks and techniques you need to keep SQL Server humming along smoothly.

About the book

3.92/5 on Goodreads

ISBN 9781617292132

Published in 2014

256 pages

Learning Microsoft Power Automate

Improving Productivity for Business Processes and Workflows

by Paul Papanek Stork

Processing information efficiently is critical to the successful operation of modern organizations. One particularly helpful tool is Microsoft Power Automate, a low-code/no-code development platform designed to help tech-savvy users create and implement workflows. This practical book explains how small-business and enterprise users can replace manual work that takes days with an automated process you can set up in a few hours using Power Automate.

Paul Papanek Stork, principal architect at Don't Pa..Panic Consulting, provides a concise yet comprehensive overview of the foundational skills required to understand and work with Power Automate. You'll learn how to use these workflows, or flows, to automate repetitive tasks or complete business processes without manual intervention.

Whether you're transferring form responses to a list, managing document approvals, sending automatic reminders for overdue tasks, or archiving emails and attachments, these skills will help you:

Design and build flows with templates or from scratch
Select triggers and actions to automate a process
Add actions to a flow to retrieve and process information
Use functions to transform information
Control the logic of a process using conditional actions, loops, or parallel branches
Implement error checking to avoid potential problems

About the book

0/5 on Goodreads

ISBN 9781098136369

Published in 2023

351 pages

O'Reilly Media

Learning Microsoft Power BI

Transforming Data into Insights

by Jeremey Arnold

Microsoft Power BI is a data analytics and visualization tool powerful enough for the most demanding data scientists, but accessible enough for everyday use for anyone who needs to get more from data. The market has many books designed to train and equip professional data analysts to use Power BI, but few of them make this tool accessible to anyone who wants to get up to speed on their own.

This streamlined intro to Power BI covers all the foundational aspects and features you need to go from "zero to hero" with data and visualizations. Whether you work with large, complex datasets or work in Microsoft Excel, author Jeremey Arnold shows you how to teach yourself Power BI and use it confidently as a regular data analysis and reporting tool.

You'll learn how to:

Import, manipulate, visualize, and investigate data in Power BI
Approach solutions for both self-service and enterprise BI
Use Power BI in your organization's business intelligence strategy
Produce effective reports and dashboards
Create environments for sharing reports and managing data access with your team
Determine the right solution for using Power BI offerings based on size, security, and computational needs

About the book

3.75/5 on Goodreads

ISBN 9781098112844

Published in 2022

307 pages

O'Reilly Media

Learning Spark, 2nd Edition

Lightning-Fast Data Analytics

by Jules S. Damji, Brooke Wenig, Tathagata Das and Denny Lee

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:

Learn Python, SQL, Scala, or Java high-level Structured APIs
Understand Spark operations and SQL Engine
Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow

About the book

4.36/5 on Goodreads

ISBN 9781492050049

Published in 2020

397 pages

O'Reilly Media

Mining Social Media

Finding Stories in Internet Data

by Lam Thuy Vo

Did fake Twitter accounts help sway a presidential election? What can Facebook and Reddit archives tell us about human behavior? In Mining Social Media, senior BuzzFeed reporter Lam Thuy Vo shows you how to use Python and key data analysis tools to find the stories buried in social media.

Whether you’re a professional journalist, an academic researcher, or a citizen investigator, you’ll learn how to use technical tools to collect and analyze data from social media sources to build compelling, data-driven stories.

Learn how to:

Write Python scripts and use APIs to gather data from the social web
Download data archives and dig through them for insights
Inspect HTML downloaded from websites for useful content
Format, aggregate, sort, and filter your collected data using Google Sheets
Create data visualizations to illustrate your discoveries
Perform advanced data analysis using Python, Jupyter Notebooks, and the pandas library
Apply what you’ve learned to research topics on your own

Social media is filled with thousands of hidden stories just waiting to be told. Learn to use the data-sleuthing tools that professionals use to write your own data-driven stories.

About the book

4/5 on Goodreads

ISBN 9781593279165

Published in 2019

208 pages

Modeling and Simulation in Python

An Introduction for Scientists and Engineers

by Allen B. Downey

Modeling and Simulation in Python is a thorough but easy-to-follow introduction to physical modeling—that is, the art of describing and simulating real-world systems.

Readers are guided through modeling things like world population growth, infectious disease, bungee jumping, baseball flight trajectories, celestial mechanics, and more while simultaneously developing a strong understanding of fundamental programming concepts like loops, vectors, and functions.

Clear and concise, with a focus on learning by doing, the author spares the reader abstract, theoretical complexities and gets right to hands-on examples that show how to produce useful models and simulations.

About the book

3.75/5 on Goodreads

ISBN 9781718502161

Published in 2023

280 pages

Modern Time Series Forecasting with Python

Explore industry-ready rime series forecasting using modern machine learning and deep learning

by Manu Joseph

We live in a serendipitous era where the explosion in the quantum of data collected and a renewed interest in data-driven techniques such as machine learning (ML), has changed the landscape of analytics, and with it, time series forecasting. This book, filled with industry-tested tips and tricks, takes you beyond commonly used classical statistical methods such as ARIMA and introduces to you the latest techniques from the world of ML.

This is a comprehensive guide to analyzing, visualizing, and creating state-of-the-art forecasting systems, complete with common topics such as ML and deep learning (DL) as well as rarely touched-upon topics such as global forecasting models, cross-validation strategies, and forecast metrics. You’ll begin by exploring the basics of data handling, data visualization, and classical statistical methods before moving on to ML and DL models for time series forecasting. This book takes you on a hands-on journey in which you’ll develop state-of-the-art ML (linear regression to gradient-boosted trees) and DL (feed-forward neural networks, LSTMs, and transformers) models on a real-world dataset along with exploring practical topics such as interpretability.

By the end of this book, you’ll be able to build world-class time series forecasting systems and tackle problems in the real world.

What you will learn

Find out how to manipulate and visualize time series data like a pro
Set strong baselines with popular models such as ARIMA
Discover how time series forecasting can be cast as regression
Engineer features for machine learning models for forecasting
Explore the exciting world of ensembling and stacking models
Get to grips with the global forecasting paradigm
Understand and apply state-of-the-art DL models such as N-BEATS and Autoformer
Explore multi-step forecasting and cross-validation strategies

Who this book is for

The book is for data scientists, data analysts, machine learning engineers, and Python developers who want to build industry-ready time series models. Since the book explains most concepts from the ground up, basic proficiency in Python is all you need. Prior understanding of machine learning or forecasting will help speed up your learning. For experienced machine learning and forecasting practitioners, this book has a lot to offer in terms of advanced techniques and traversing the latest research frontiers in time series forecasting.

About the book

4.3/5 on Goodreads

ISBN 9781803246802

Published in 2022

552 pages

MongoDB in Action, 2nd Edition

Covers MongoDB version 3.0

by Kyle Banker, Peter Bakkum, Shaun Verch, Douglas Garrett and Tim Hawkins

MongoDB in Action, Second Edition is a completely revised and updated version. It introduces MongoDB 3.0 and the document-oriented database model. This perfectly paced book gives you both the big picture you'll need as a developer and enough low-level detail to satisfy system engineers.

About the book

4.01/5 on Goodreads

ISBN 9781617291609

Published in 2016

480 pages

MongoDB: The Definitive Guide, 3rd Edition

Powerful and Scalable Data Storage

by Shannon Bradshaw, Eoin Brazil and Kristina Chodorow

Manage your data with a system designed to support modern application development. Updated for MongoDB 4.2, the third edition of this authoritative and accessible guide shows you the advantages of using document-oriented databases. You’ll learn how this secure, high-performance system enables flexible data models, high availability, and horizontal scalability.

Authors Shannon Bradshaw, Eoin Brazil, and Kristina Chodorow provide guidance for database developers, advanced configuration for system administrators, and use cases for a variety of projects. NoSQL newcomers and experienced MongoDB users will find updates on querying, indexing, aggregation, transactions, replica sets, ops management, sharding and data administration, durability, monitoring, and security.

In six parts, this book shows you how to:

Work with MongoDB, perform write operations, find documents, and create complex queries
Index collections, aggregate data, and use transactions for your application
Configure a local replica set and learn how replication interacts with your application
Set up cluster components and choose a shard key for a variety of applications
Explore aspects of application administration and configure authentication and authorization
Use stats when monitoring, back up and restore deployments, and use system settings when deploying MongoDB

About the book

3.87/5 on Goodreads

ISBN 9781491954461

Published in 2019

511 pages

O'Reilly Media

Natural Language Processing for Hackers

Learn to build apps that can understand people

by George-Bogdan Ivanov

Natural Language Processing (NLP) is a collection of techniques to analyze, interpret, and create human-understandable text and speech. Advances in machine learning have pushed NLP to new levels of accuracy and uncanny realism.

Natural Language Processing for Hackers lays out everything you need to crawl, clean, build, fine-tune, and deploy natural language models from scratch—all with easy-to-read Python code.

About the book

3.67/5 on Goodreads

ISBN 9781617296567

Published in 2019

176 pages

Natural Language Processing with Python and spaCy

A Practical Introduction

by Yuli Vasiliev

Natural Language Processing with Python and spaCy will show you how to create NLP applications like chatbots, text-condensing scripts, and order-processing tools quickly and easily. You’ll learn how to leverage the spaCy library to extract meaning from text intelligently; how to determine the relationships between words in a sentence (syntactic dependency parsing); identify nouns, verbs, and other parts of speech (part-of-speech tagging); and sort proper nouns into categories like people, organizations, and locations (named entity recognizing). You’ll even learn how to transform statements into questions to keep a conversation going.

You’ll also learn how to:

Work with word vectors to mathematically find words with similar meanings (Chapter 5)
Identify patterns within data using spaCy's built-in displaCy visualizer (Chapter 7)
Automatically extract keywords from user input and store them in a relational database (Chapter 9)
Deploy a chatbot app to interact with users over the internet (Chapter 11)

“Try This” sections in each chapter encourage you to practice what you’ve learned by expanding the book’s example scripts to handle a wider range of inputs, add error handling, and build professional-quality applications.

By the end of the book, you’ll be creating your own NLP applications with Python and spaCy.

About the book

3.79/5 on Goodreads

ISBN 9781718500525

Published in 2020

216 pages

Natural Language Processing with Transformers, Revised Edition

Building Language Applications with Hugging Face

by Lewis Tunstall, Leandro von Werra and Thomas Wolf

Since their introduction in 2017, transformers have quickly become the dominant architecture for achieving state-of-the-art results on a variety of natural language processing tasks. If you're a data scientist or coder, this practical book -now revised in full color- shows you how to train and scale these large models using Hugging Face Transformers, a Python-based deep learning library.

Transformers have been used to write realistic news stories, improve Google Search queries, and even create chatbots that tell corny jokes. In this guide, authors Lewis Tunstall, Leandro von Werra, and Thomas Wolf, among the creators of Hugging Face Transformers, use a hands-on approach to teach you how transformers work and how to integrate them in your applications. You'll quickly learn a variety of tasks they can help you solve.

Build, debug, and optimize transformer models for core NLP tasks, such as text classification, named entity recognition, and question answering
Learn how transformers can be used for cross-lingual transfer learning
Apply transformers in real-world scenarios where labeled data is scarce
Make transformer models efficient for deployment using techniques such as distillation, pruning, and quantization
Train transformers from scratch and learn how to scale to multiple GPUs and distributed environments

About the book

4.44/5 on Goodreads

ISBN 9781098136796

Published in 2022

406 pages

O'Reilly Media

Official Google Cloud Certified Professional Data Engineer Study Guide

by Dan Sullivan

The Google Cloud Certified Professional Data Engineer Study Guide, provides everything you need to prepare for this important exam and master the skills necessary to land that coveted Google Cloud Professional Data Engineer certification. Beginning with a pre-book assessment quiz to evaluate what you know before you begin, each chapter features exam objectives and review questions, plus the online learning environment includes additional complete practice tests.

Written by Dan Sullivan, a popular and experienced online course author for machine learning, big data, and Cloud topics, Google Cloud Certified Professional Data Engineer Study Guide is your ace in the hole for deploying and managing analytics and machine learning applications.

Build and operationalize storage systems, pipelines, and compute infrastructure
Understand machine learning models and learn how to select pre-built models
Monitor and troubleshoot machine learning models
Design analytics and machine learning applications that are secure, scalable, and highly available.

This exam guide is designed to help you develop an in depth understanding of data engineering and machine learning on Google Cloud Platform.

About the book

4.3/5 on Goodreads

ISBN 9781119618430

Published in 2020

352 pages

Sybex

Pandas for Everyone: Python Data Analysis, 2nd Edition

by Daniel Y. Chen

Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple data sets.

Pandas for Everyone, 2nd Edition, brings together practical knowledge and insight for solving real problems with Pandas, even if youre new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world data science problems such as using regularization to prevent data overfitting, or when to use unsupervised machine learning methods to find the underlying structure in a data set.

New features to the second edition include:

Extended coverage of plotting and the seaborn data visualization library
Expanded examples and resources
Updated Python 3.9 code and packages coverage, including statsmodels and scikit-learn libraries
Online bonus material on geopandas, Dask, and creating interactive graphics with Altair

Chen gives you a jumpstart on using Pandas with a realistic data set and covers combining data sets, handling missing data, and structuring data sets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes.

Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability and introduces you to the wider Python data analysis ecosystem.

Work with DataFrames and Series, and import or export data
Create plots with matplotlib, seaborn, and pandas
Combine data sets and handle missing data
Reshape, tidy, and clean data sets so theyre easier to work with
Convert data types and manipulate text strings
Apply functions to scale data manipulations
Aggregate, transform, and filter large data sets with groupby
Leverage Pandas advanced date and time capabilities
Fit linear models using statsmodels and scikit-learn libraries
Use generalized linear modeling to fit models with different response variables
Compare multiple models to select the best one
Regularize to overcome overfitting and improve performance
Use clustering in unsupervised machine learning

About the book

4/5 on Goodreads

ISBN 9780137891153

Published in 2022

512 pages

Addison-Wesley Professional

by Boris Paskhaver

Take the next steps in your data science career! This friendly and hands-on guide shows you how to start mastering Pandas with skills you already know from spreadsheet software.

In Pandas in Action you will learn how to:

Import datasets, identify issues with their data structures, and optimize them for efficiency
Sort, filter, pivot, and draw conclusions from a dataset and its subsets
Identify trends from text-based and time-based data
Organize, group, merge, and join separate datasets
Use a GroupBy object to store multiple DataFrames

Pandas has rapidly become one of Python's most popular data analysis libraries. In Pandas in Action, a friendly and example-rich introduction, author Boris Paskhaver shows you how to master this versatile tool and take the next steps in your data science career. You’ll learn how easy Pandas makes it to efficiently sort, analyze, filter and munge almost any type of data.

About the book

4.2/5 on Goodreads

ISBN 9781617297434

Published in 2021

440 pages

PostGIS in Action, 3rd Edition

by Leo S. Hsu and Regina O. Obe

In PostGIS in Action, Third Edition you will learn:

An introduction to spatial databases
Geometry, geography, raster, and topology spatial types, functions, and queries
Applying PostGIS to real-world problems
Extending PostGIS to web and desktop applications
Querying data from external sources using PostgreSQL Foreign Data Wrappers
Optimizing queries for maximum speed
Simplifying geometries for greater efficiency

PostGIS in Action, Third Edition teaches readers of all levels to write spatial queries for PostgreSQL. You’ll start by exploring vector-, raster-, and topology-based GIS before quickly progressing to analyzing, viewing, and mapping data. This fully updated third edition covers key changes in PostGIS 3.1 and PostgreSQL 13, including parallelization support, partitioned tables, and new JSON functions that help in creating web mapping applications.

About the book

3/5 on Goodreads

ISBN 9781617296697

Published in 2021

704 pages

PostgreSQL 16 Administration Cookbook

Solve real-world Database Administration challenges with 180+ practical recipes and best practices

by Gianni Ciolli, Boriss Mejías, Jimmy Angelakos, Vibhor Kumar and Simon Riggs

PostgreSQL has seen a huge increase in its customer base in the past few years and is becoming one of the go-to solutions for anyone who has a database-specific challenge. This PostgreSQL book touches on all the fundamentals of Database Administration in a problem-solution format. It is intended to be the perfect desk reference guide.

This new edition focuses on recipes based on the new PostgreSQL 16 release. The additions include handling complex batch loading scenarios with the SQL MERGE statement, security improvements, running Postgres on Kubernetes or with TPA and Ansible, and more. This edition also focuses on certain performance gains, such as query optimization, and the acceleration of specific operations, such as sort. It will help you understand roles, ensuring high availability, concurrency, and replication. It also draws your attention to aspects like validating backups, recovery, monitoring, and scaling aspects. This book will act as a one-stop solution to all your real-world database administration challenges.

By the end of this book, you will be able to manage, monitor, and replicate your PostgreSQL 16 database for efficient administration and maintenance with the best practices from experts.

What you will learn

Discover how to improve batch data loading with the SQL MERGE statement
Use logical replication to apply large transactions in parallel
Improve your back up and recovery performance with server-side compression
Tackle basic to high-end and real-world PostgreSQL challenges with practical recipes
Monitor and fine-tune your database with ease
Learn to navigate the newly introduced features of PostgreSQL 16
Efficiently secure your PostgreSQL database with new and updated features

Who this book is for

This Postgres book is for database administrators, data architects, database developers, and anyone with an interest in planning and running live production databases using PostgreSQL 14. Those looking for hands-on solutions to any problem associated with PostgreSQL 14 administration will also find this book useful. Some experience with handling PostgreSQL databases will help you to make the most out of this book, however, it is a useful resource even if you are just beginning your Postgres journey

About the book

0/5 on Goodreads

ISBN 9781835460580

Published in 2023

636 pages

Practical Data Science with R, 2nd Edition

by Nina Zumel and John Mount

Practical Data Science with R, Second Edition takes a practice-oriented approach to explaining basic principles in the ever expanding field of data science. You’ll jump right to real-world use cases as you apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.

About the book

4.17/5 on Goodreads

ISBN 9781617295874

Published in 2019

568 pages

Practical Linear Algebra for Data Science

From Core Concepts to Applications Using Python

by Mike X Cohen

If you want to work in any computational or technical field, you need to understand linear algebra. As the study of matrices and operations acting upon them, linear algebra is the mathematical basis of nearly all algorithms and analyses implemented in computers. But the way it's presented in decades-old textbooks is much different from how professionals use linear algebra today to solve real-world modern applications.

This practical guide from Mike X Cohen teaches the core concepts of linear algebra as implemented in Python, including how they're used in data science, machine learning, deep learning, computational simulations, and biomedical data processing applications. Armed with knowledge from this book, you'll be able to understand, implement, and adapt myriad modern analysis methods and algorithms.

Ideal for practitioners and students using computer technology and algorithms, this book introduces you to:

The interpretations and applications of vectors and matrices
Matrix arithmetic (various multiplications and transformations)
Independence, rank, and inverses
Important decompositions used in applied linear algebra (including LU and QR)
Eigendecomposition and singular value decomposition
Applications including least-squares model fitting and principal components analysis

About the book

4.33/5 on Goodreads

ISBN 9781098120610

Published in 2022

326 pages

O'Reilly Media

Practical Natural Language Processing

A Comprehensive Guide to Building Real-World NLP Systems

by Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta and Harshit Surana

Many books and courses tackle natural language processing (NLP) problems with toy use cases and well-defined datasets. But if you want to build, iterate, and scale NLP systems in a business setting and tailor them for particular industry verticals, this is your guide. Software engineers and data scientists will learn how to navigate the maze of options available at each step of the journey.

Through the course of the book, authors Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana will guide you through the process of building real-world NLP solutions embedded in larger product setups. You’ll learn how to adapt your solutions for different industry verticals such as healthcare, social media, and retail.

With this book, you’ll:

Understand the wide spectrum of problem statements, tasks, and solution approaches within NLP
Implement and evaluate different NLP applications using machine learning and deep learning methods
Fine-tune your NLP solution based on your business problem and industry vertical
Evaluate various algorithms and approaches for NLP product tasks, datasets, and stages
Produce software solutions following best practices around release, deployment, and DevOps for NLP systems
Understand best practices, opportunities, and the roadmap for NLP from a business and product leader’s perspective

About the book

3.88/5 on Goodreads

ISBN 9781492054054

Published in 2020

454 pages

O'Reilly Media

Practical Probabilistic Programming

by Avi Pfeffer

Practical Probabilistic Programming introduces the working programmer to probabilistic programming. In it, you'll learn how to use the PP paradigm to model application domains and then express those probabilistic models in code. Although PP can seem abstract, in this book you'll immediately work on practical examples, like using the Figaro language to build a spam filter and applying Bayesian and Markov networks, to diagnose computer system data problems and recover digital images.

About the book

3/5 on Goodreads

ISBN 9781617292330

Published in 2016

456 pages

Practical Recommender Systems

by Kim Falk

Online recommender systems help users find movies, jobs, restaurants—even romance! There’s an art in combining statistics, demographics, and query terms to achieve results that will delight them. Learn to build a recommender system the right way: it can make or break your application!

About the book

4.14/5 on Goodreads

ISBN 9781617292705

Published in 2019

432 pages

Practical Statistics for Data Scientists, 2nd Edition

50+ Essential Concepts Using R and Python

by Peter Bruce, Andrew Bruce and Peter Gedeck

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher-quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that "learn" from data
Unsupervised learning methods for extracting meaning from unlabeled data

About the book

4.3/5 on Goodreads

ISBN 9781492072942

Published in 2020

360 pages

O'Reilly Media

Practical Time Series Analysis

Prediction with Statistics & Machine Learning

by Aileen Nielsen

Time series data analysis is increasingly important due to the massive production of such data through the internet of things, the digitalization of healthcare, and the rise of smart cities. As continuous monitoring and data collection become more common, the need for competent time series analysis with both statistical and machine learning techniques will increase.

Covering innovations in time series data analysis and use cases from the real world, this practical guide will help you solve the most common data engineering and analysis challengesin time series, using both traditional statistical and modern machine learning techniques. Author Aileen Nielsen offers an accessible, well-rounded introduction to time series in both R and Python that will have data scientists, software engineers, and researchers up and running quickly.

You’ll get the guidance you need to confidently:

Find and wrangle time series data
Undertake exploratory time series data analysis
Store temporal data
Simulate time series data
Generate and select features for a time series
Measure error
Forecast and classify time series with machine or deep learning
Evaluate accuracy and performance

About the book

3.77/5 on Goodreads

ISBN 9781492041658

Published in 2019

497 pages

O'Reilly Media

Probability and Statistics for Computer Scientists, 3rd Edition

by Michael Baron

Probability and Statistics for Computer Scientists, Third Edition helps students understand fundamental concepts of Probability and Statistics, general methods of stochastic modeling, simulation, queuing, and statistical data analysis; make optimal decisions under uncertainty; model and evaluate computer systems; and prepare for advanced probability-based courses. Written in a lively style with simple language and now including R as well as MATLAB, this classroom-tested book can be used for one- or two-semester courses.

Features:

Axiomatic introduction of probability
Expanded coverage of statistical inference and data analysis, including estimation and testing, Bayesian approach, multivariate regression, chi-square tests for independence and goodness of fit, nonparametric statistics, and bootstrap
Numerous motivating examples and exercises including computer projects
Fully annotated R codes in parallel to MATLAB
Applications in computer science, software engineering, telecommunications, and related areas
In-Depth yet Accessible Treatment of Computer Science-Related Topics

Starting with the fundamentals of probability, the text takes students through topics heavily featured in modern computer science, computer engineering, software engineering, and associated fields, such as computer simulations, Monte Carlo methods, stochastic processes, Markov chains, queuing theory, statistical inference, and regression. It also meets the requirements of the Accreditation Board for Engineering and Technology (ABET).

About the book

3.84/5 on Goodreads

ISBN 9781138044487

Published in 2019

473 pages

CRC Press

Programming Skills for Data Science

Start Writing Code to Wrangle, Analyze, and Visualize Data with R

by Joel Ross and Michael Freeman

Using data science techniques, you can transform raw data into actionable insights for domains ranging from urban planning to precision medicine. Programming Skills for Data Science brings together all the foundational skills you need to get started, even if you have no programming or data science experience.

Leading instructors Michael Freeman and Joel Ross guide you through installing and configuring the tools you need to solve professional-level data science problems, including the widely used R language and Git version-control system. They explain how to wrangle your data into a form where it can be easily used, analyzed, and visualized so others can see the patterns you've uncovered. Step by step, you'll master powerful R programming techniques and troubleshooting skills for probing data in new ways, and at larger scales.

Freeman and Ross teach through practical examples and exercises that can be combined into complete data science projects. Everything's focused on real-world application, so you can quickly start analyzing your own data and getting answers you can act upon. Learn to

Install your complete data science environment, including R and RStudio
Manage projects efficiently, from version tracking to documentation
Host, manage, and collaborate on data science projects with GitHub
Master R language fundamentals: syntax, programming concepts, and data structures
Load, format, explore, and restructure data for successful analysis
Interact with databases and web APIs
Master key principles for visualizing data accurately and intuitively
Produce engaging, interactive visualizations with ggplot and other R packages
Transform analyses into sharable documents and sites with R Markdown
Create interactive web data science applications with Shiny
Collaborate smoothly as part of a data science team

Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

About the book

4.33/5 on Goodreads

ISBN 9781805129790

Published in 2018

384 pages

Addison-Wesley Professional

Python Data Science Handbook, 2nd Edition

Essential Tools for Working with Data

by Jake VanderPlas

Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools.

Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.

With this handbook, you'll learn how:

IPython and Jupyter provide computational environments for scientists using Python
NumPy includes the ndarray for efficient storage and manipulation of dense data arrays
Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data
Matplotlib includes capabilities for a flexible range of data visualizations
Scikit-learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms

About the book

3.86/5 on Goodreads

ISBN 9781098121228

Published in 2022

588 pages

O'Reilly Media

Python for Data Analysis, 3rd Edition

Data Wrangling with pandas, NumPy & Jupyter

by Wes McKinney

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process.

Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.

Use the Jupyter notebook and IPython shell for exploratory computing
Learn basic and advanced features in NumPy
Get started with data analysis tools in the pandas library
Use flexible tools to load, clean, transform, merge, and reshape data
Create informative visualizations with matplotlib
Apply the pandas groupby facility to slice, dice, and summarize datasets
Analyze and manipulate regular and irregular time series data
Learn how to solve real-world data analysis problems with thorough, detailed examples

About the book

4.17/5 on Goodreads

ISBN 9781098104030

Published in 2022

579 pages

O'Reilly Media

Python for Data Science

A Hands-On Introduction

by Yuli Vasiliev

You will discover Python’s rich set of built-in data structures for basic operations, as well as its robust ecosystem of open-source libraries for data science, including NumPy, pandas, scikit-learn, matplotlib, and more. Examples show how to load data in various formats, how to streamline, group, and aggregate data sets, and how to create charts, maps, and other visualizations. Later chapters go in-depth with demonstrations of real-world data applications, including using location data to power a taxi service, market basket analysis to identify items commonly purchased together, and machine learning to predict stock prices.

About the book

4.11/5 on Goodreads

ISBN 9781718502208

Published in 2022

240 pages

Python Tools for Scientists

An Introduction to Using Anaconda, JupyterLab, and Python's Scientific Libraries

by Lee Vaughan

Python Tools for Scientists will introduce you to Python tools you can use in your scientific research, including Anaconda, Spyder, Jupyter Notebooks, JupyterLab, and numerous Python libraries. You’ll learn to use Python for tasks such as creating visualizations, representing geospatial information, simulating natural events, and manipulating numerical data.

Once you’ve built an optimal programming environment with Anaconda, you’ll learn how to organize your projects and use interpreters, text editors, notebooks, and development environments to work with your code. Following the book’s fast-paced Python primer, you’ll tour a range of scientific tools and libraries like scikit-learn and seaborn that you can use to manipulate and visualize your data, or analyze it with machine learning algorithms.

You’ll also learn how to:

Create isolated projects in virtual environments, build interactive notebooks, test code in the Qt console, and use Spyder’s interactive development features
Use Python’s built-in data types, write custom functions and classes, and document your code
Represent data with the essential NumPy, Matplotlib, and pandas libraries
Use Python plotting libraries like Plotly, HoloViews, and Datashader to handle large datasets and create 3D visualizations

Regardless of your scientific field, Python Tools for Scientists will show you how to choose the best tools to meet your research and computational analysis needs.

About the book

4.58/5 on Goodreads

ISBN 9781718502666

Published in 2022

744 pages

R for Data Science, 2nd Edition

Import, Tidy, Transform, Visualize and Model Data

by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund

Use R to turn data into insight, knowledge, and understanding. With this practical book, aspiring data scientists will learn how to do data science with R and RStudio, along with the tidyverse—a collection of R packages designed to work together to make data science fast, fluent, and fun. Even if you have no programming experience, this updated edition will have you doing data science quickly.

You'll learn how to import, transform, and visualize your data and communicate the results. And you'll get a complete, big-picture understanding of the data science cycle and the basic tools you need to manage the details. Updated for the latest tidyverse features and best practices, new chapters show you how to get data from spreadsheets, databases, and websites. Exercises help you practice what you've learned along the way.

You'll understand how to:

Visualize: Create plots for data exploration and communication of results
Transform: Discover variable types and the tools to work with them
Import: Get data into R and in a form convenient for analysis
Program: Learn R tools for solving data problems with greater clarity and ease
Communicate: Integrate prose, code, and results with Quarto

About the book

4.55/5 on Goodreads

ISBN 9781492097402

Published in 2023

576 pages

O'Reilly Media

Real-World Natural Language Processing

Practical applications with deep learning

by Masato Hagiwara

In Real-world Natural Language Processing you will learn how to:

Design, develop, and deploy useful NLP applications
Create named entity taggers
Build machine translation systems
Construct language generation systems and chatbots
Use advanced NLP concepts such as attention and transfer learning

Real-world Natural Language Processing teaches you how to create practical NLP applications without getting bogged down in complex language theory and the mathematics of deep learning. In this engaging book, you’ll explore the core tools and techniques required to build a huge range of powerful NLP apps, including chatbots, language detectors, and text classifiers.

About the book

3.43/5 on Goodreads

ISBN 9781617296420

Published in 2021

336 pages