/dev/reading
Category

Data Science

125 books, 6 subcategories
Order by
View
by Dzejla Medjedovic, Emin Tahirovic and Ines Dedovic

Massive modern datasets make traditional data structures and algorithms grind to a halt. This fun and practical guide introduces cutting-edge techniques that can reliably handle even the largest distributed datasets.

In Algorithms and Data Structures for Massive Datasets you will learn:

  • Probabilistic sketching data structures for practical problems
  • Choosing the right database engine for your application
  • Evaluating and designing efficient on-disk data structures and algorithms
  • Understanding the algorithmic trade-offs involved in massive-scale systems
  • Deriving basic statistics from streaming data
  • Correctly sampling streaming data
  • Computing percentiles with limited space resources

Algorithms and Data Structures for Massive Datasets reveals a toolbox of new methods that are perfect for handling modern big data applications. You’ll explore the novel data structures and algorithms that underpin Google, Facebook, and other enterprise applications that work with truly massive amounts of data. These effective techniques can be applied to any discipline, from finance to text analysis. Graphics, illustrations, and hands-on industry examples make complex ideas practical to implement in your projects—and there’s no mathematical proofs to puzzle over. Work through this one-of-a-kind guide, and you’ll find the sweet spot of saving space without sacrificing your data’s accuracy.

by Douglas G. McIlwraith, Haralambos Marmanis and Dmitry Babenko

Algorithms of the Intelligent Web, Second Edition teaches the most important approaches to algorithmic web data analysis, enabling you to create your own machine learning applications that crunch, munge, and wrangle data collected from users, web applications, sensors and website logs.

Building Meaningful Data Models at Scale
by Rui Pedro Machado and Helder Russa

With the shift from data warehouses to data lakes, data now lands in repositories before it's been transformed, enabling engineers to model raw data into clean, well-defined datasets. dbt (data build tool) helps you take data further. This practical book shows data analysts, data engineers, BI developers, and data scientists how to create a true self-service transformation platform through the use of dynamic SQL.

Authors Rui Machado from Monstarlab and Hélder Russa from Jumia show you how to quickly deliver new data products by focusing more on value delivery and less on architectural and engineering aspects. If you know your business well and have the technical skills to model raw data into clean, well-defined datasets, you'll learn how to design and deliver data models without any technical influence.

With this book, you'll learn:

  • What dbt is and how a dbt project is structured
  • How dbt fits into the data engineering and analytics worlds
  • How to collaborate on building data models
  • The main tools and architectures for building useful, functional data models
  • How to fit dbt into data warehousing and laking architecture
  • How to build tests for data transformations
Enable Analytics and AI-Driven Innovation in the Cloud
by Marco Tranquillin, Valliappa Lakshmanan and Firat Tekiner

All cloud architects need to know how to build data platforms that enable businesses to make data-driven decisions and deliver enterprise-wide intelligence in a fast and efficient way. This handbook shows you how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, and multicloud tools like Snowflake and Databricks.

Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle from ingestion to activation in a cloud environment using real-world enterprise architectures. You'll learn how to transform, secure, and modernize familiar solutions like data warehouses and data lakes, and you'll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage.

You'll learn how to:

  • Design a modern and secure cloud native or hybrid data analytics and machine learning platform
  • Accelerate data-led innovation by consolidating enterprise data in a governed, scalable, and resilient data platform
  • Democratize access to enterprise data and govern how business teams extract insights and build AI/ML capabilities
  • Enable your business to make decisions in real time using streaming pipelines
  • Build an MLOps platform to move to a predictive and prescriptive analytics approach
A Tour of Statistical Software Design
by Norman Matloff

R is the world's most popular language for developing statistical software: Archaeologists use it to track the spread of ancient civilizations, drug companies use it to discover which medications are safe and effective, and actuaries use it to assess financial risks and keep economies running smoothly.

The Art of R Programming takes you on a guided tour of software development with R, from basic types and data structures to advanced topics like closures, recursion, and anonymous functions. No statistical knowledge is required, and your programming skills can range from hobbyist to pro.

Along the way, you'll learn about functional and object-oriented programming, running mathematical simulations, and rearranging complex data into simpler, more useful formats. You'll also learn to:

  • Create artful graphs to visualize complex data sets and functions
  • Write more efficient code using parallel R and vectorization
  • Interface R with C/C++ and Python for increased speed or functionality
  • Find new R packages for text analysis, image manipulation, and more
  • Squash annoying bugs with advanced debugging techniques

Whether you're designing aircraft, forecasting the weather, or you just need to tame your data, The Art of R Programming is your guide to harnessing the power of statistical computing.

A practical guide to probabilistic modelling
by Osvaldo Martin

The third edition of Bayesian Analysis with Python serves as an introduction to the main concepts of applied Bayesian modeling using PyMC, a state-of-the-art probabilistic programming library, and other libraries that support and facilitate modeling like ArviZ, for exploratory analysis of Bayesian models; Bambi, for flexible and easy hierarchical linear modeling; PreliZ, for prior elicitation; PyMC-BART, for flexible non-parametric regression; and Kulprit, for variable selection.

In this updated edition, a brief and conceptual introduction to probability theory enhances your learning journey by introducing new topics like Bayesian additive regression trees (BART), featuring updated examples. Refined explanations, informed by feedback and experience from previous editions, underscore the book's emphasis on Bayesian statistics. You will explore various models, including hierarchical models, generalized linear models for regression and classification, mixture models, Gaussian processes, and BART, using synthetic and real datasets.

By the end of this book, you will possess a functional understanding of probabilistic modeling, enabling you to design and implement Bayesian models for your data science challenges. You'll be well-prepared to delve into more advanced material or specialized statistical modeling if the need arises.

What you will learn

  • Build probabilistic models using PyMC and Bambi
  • Analyze and interpret probabilistic models with ArviZ
  • Acquire the skills to sanity-check models and modify them if necessary
  • Build better models with prior and posterior predictive checks
  • Learn the advantages and caveats of hierarchical models
  • Compare models and choose between alternative ones
  • Interpret results and apply your knowledge to real-world problems
  • Explore common models from a unified probabilistic perspective
  • Apply the Bayesian framework's flexibility for probabilistic thinking

Who this book is for

If you are a student, data scientist, researcher, or developer looking to get started with Bayesian data analysis and probabilistic programming, this book is for you. The book is introductory, so no previous statistical knowledge is required, although some experience in using Python and scientific libraries like NumPy is expected.

Understanding Statistics and Probability with Star Wars, LEGO, and Rubber Ducks
by Will Kurt

Get the most from your data, and have fun doing it

Probability and statistics are increasingly important in a huge range of professions. But many people use data in ways they don’t even understand, meaning they aren’t getting the most from it. Bayesian Statistics the Fun Way will change that.

This book will give you a complete understanding of Bayesian statistics through simple explanations and un-boring examples. Find out the probability of UFOs landing in your garden, how likely Han Solo is to survive a flight through an asteroid belt, how to win an argument about conspiracy theories, and whether a burglary really was a burglary, to name a few examples.

By using these off-the-beaten-track examples, the author actually makes learning statistics fun. And you’ll learn real skills, like how to:

  • How to measure your own level of uncertainty in a conclusion or belief
  • Calculate Bayes theorem and understand what it’s useful for
  • Find the posterior, likelihood, and prior to check the accuracy of your conclusions
  • Calculate distributions to see the range of your data
  • Compare hypotheses and draw reliable conclusions from them

Next time you find yourself with a sheaf of survey results and no idea what to do with them, turn to Bayesian Statistics the Fun Way to get the most value from your data.

A beginner's guide to R and RStudio
by Dr. Jonathan Carroll

Beyond Spreadsheets with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. You’ll build on simple programming techniques like loops and conditionals to create your own custom functions. You’ll come away with a toolkit of strategies for analyzing and visualizing data of all sorts using R and RStudio.

Principles and best practices of scalable realtime data systems
by Nathan Marz and James Warren

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.

by Emily Robinson and Jacqueline Nolis

You are going to need more than technical knowledge to succeed as a data scientist.

Build a Career in Data Science teaches you what school leaves out, from how to land your first job to the lifecycle of a data science project, and even how to become a manager.

Patterns for Designing & Building Event-Driven Architectures
by Adam Bellemare

The exponential growth of data combined with the need to derive real-time business value is a critical issue today. An event-driven data mesh can power real-time operational and analytical workloads, all from a single set of data product streams. With practical real-world examples, this book shows you how to successfully design and build an event-driven data mesh.

Building an Event-Driven Data Mesh provides:

  • Practical tips for iteratively building your own event-driven data mesh, including hurdles you'll experience, possible solutions, and how to obtain real value as soon as possible
  • Solutions to pitfalls you may encounter when moving your organization from monoliths to event-driven architectures
  • A clear understanding of how events relate to systems and other events in the same stream and across streams
  • A realistic look at event modeling options, such as fact, delta, and command type events, including how these choices will impact your data products
  • Best practices for handling events at scale, privacy, and regulatory compliance
  • Advice on asynchronous communication and handling eventual consistency
Create and deploy enterprise-ready ETL pipelines by employing modern methods
by Brij Kishore Pandey and Emily Ro Schoof

Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing.

In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments.

By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.

What you will learn

  • Explore the available libraries and tools to create ETL pipelines using Python
  • Write clean and resilient ETL code in Python that can be extended and easily scaled
  • Understand the best practices and design principles for creating ETL pipelines
  • Orchestrate the ETL process and scale the ETL pipeline effectively
  • Discover tools and services available in AWS for ETL pipelines
  • Understand different testing strategies and implement them with the ETL process

Who this book is for

If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.

A Practitioner's Guide
by Jesus Barrasa and Jim Webber

Incredibly useful, knowledge graphs help organizations keep track of medical research, cybersecurity threat intelligence, GDPR compliance, web user engagement, and much more. They do so by storing interlinked descriptions of entities--objects, events, situations, or abstract concepts---and encoding the underlying information. How do you create a knowledge graph? And how do you move it from theory into production?

Using hands-on examples, this practical book shows data scientists and data engineers how to build their own knowledge graphs. Authors Jesus Barrasa and Jim Webber from Neo4j illustrate common patterns for building knowledge graphs that solve many of today's pressing knowledge management problems. You'll quickly discover how these graphs become increasingly useful as you add data and augment them with algorithms and machine learning.

  • Learn the organizing principles necessary to build a knowledge graph
  • Explore how graph databases serve as a foundation for knowledge graphs
  • Understand how to import structured and unstructured data into your graph
  • Follow examples to build integration-and-search knowledge graphs
  • Learn what pattern detection knowledge graphs help you accomplish
  • Explore dependency knowledge graphs through examples
  • Use examples of natural language knowledge graphs and chatbots
  • Use graph algorithms and ML to gain insight into connected data
Applying Causal Inference in the Tech Industry
by Matheus Facure

How many buyers will an additional dollar of online marketing bring in? Which customers will only buy when given a discount coupon? How do you establish an optimal pricing strategy? The best way to determine how the levers at our disposal affect the business metrics we want to drive is through causal inference.

In this book, author Matheus Facure, senior data scientist at Nubank, explains the largely untapped potential of causal inference for estimating impacts and effects. Managers, data scientists, and business analysts will learn classical causal inference methods like randomized control trials (A/B tests), linear regression, propensity score, synthetic controls, and difference-in-differences. Each method is accompanied by an application in the industry to serve as a grounding example.

With this book, you will:

  • Learn how to use basic concepts of causal inference
  • Frame a business problem as a causal inference problem
  • Understand how bias gets in the way of causal inference
  • Learn how causal effects can differ from person to person
  • Use repeated observations of the same customers across time to adjust for biases
  • Understand how causal effects differ across geographic locations
  • Examine noncompliance bias and effect dilution
by Satnam Alag

There's a great deal of wisdom in a crowd, but how do you listen to a thousand people talking at once? Identifying the wants, needs, and knowledge of internet users can be like listening to a mob.

In the Web 2.0 era, leveraging the collective power of user contributions, interactions, and feedback is the key to market dominance. A new category of powerful programming techniques lets you discover the patterns, inter-relationships, and individual profiles—the collective intelligence—locked in the data people leave behind as they surf websites, post blogs, and interact with other users.

Collective Intelligence in Action is a hands-on guidebook for implementing collective-intelligence concepts using Java. It is the first Java-based book to emphasize the underlying algorithms and technical implementation of vital data gathering and mining techniques like analyzing trends, discovering relationships, and making predictions. It provides a pragmatic approach to personalization by combining content-based analysis with collaborative approaches.

Land your dream job with the help of resume-building tips, over 100 mock questions, and a unique portfolio
by Kedeisha Bryan and Taamir Ransome

Preparing for a data engineering interview can often get overwhelming due to the abundance of tools and technologies, leaving you struggling to prioritize which ones to focus on. This hands-on guide provides you with the essential foundational and advanced knowledge needed to simplify your learning journey.

The book begins by helping you gain a clear understanding of the nature of data engineering and how it differs from organization to organization. As you progress through the chapters, you’ll receive expert advice, practical tips, and real-world insights on everything from creating a resume and cover letter to networking and negotiating your salary. The chapters also offer refresher training on data engineering essentials, including data modeling, database architecture, ETL processes, data warehousing, cloud computing, big data, and machine learning. As you advance, you’ll gain a holistic view by exploring continuous integration/continuous development (CI/CD), data security, and privacy. Finally, the book will help you practice case studies, mock interviews, as well as behavioral questions.

By the end of this book, you will have a clear understanding of what is required to succeed in an interview for a data engineering role.

What you will learn

  • Create maintainable and scalable code for unit testing
  • Understand the fundamental concepts of core data engineering tasks
  • Prepare with over 100 behavioral and technical interview questions
  • Discover data engineer archetypes and how they can help you prepare for the interview
  • Apply the essential concepts of Python and SQL in data engineering
  • Build your personal brand to noticeably stand out as a candidate

Who this book is for

If you’re an aspiring data engineer looking for guidance on how to land, prepare for, and excel in data engineering interviews, this book is for you. Familiarity with the fundamentals of data engineering, such as data modeling, cloud warehouses, programming (python and SQL), building data pipelines, scheduling your workflows (Airflow), and APIs, is a prerequisite.

by Jonathan Rioux

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.

In Data Analysis with Python and PySpark you will learn how to:

  • Manage your data as it scales across multiple machines
  • Scale up your data programs with full confidence
  • Read and write data to and from a variety of sources and formats
  • Deal with messy data with PySpark’s data manipulation functionality
  • Discover new data sets and perform exploratory data analysis
  • Build automated data pipelines that transform, summarize, and get insights from data
  • Troubleshoot common PySpark errors
  • Creating reliable long-running jobs

Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.

by Vlad Riscutia

Build a data platform to the industry-leading standards set by Microsoft’s own infrastructure.

In Data Engineering on Azure you will learn how to:

  • Pick the right Azure services for different data scenarios
  • Manage data inventory
  • Implement production quality data modeling, analytics, and machine learning workloads
  • Handle data governance
  • Using DevOps to increase reliability
  • Ingesting, storing, and distributing data
  • Apply best practices for compliance and access control

Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft’s own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning.

by Gareth Eagar

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability.

You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS.

By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!

What you will learn

  • Seamlessly ingest streaming data with Amazon Kinesis Data Firehose
  • Optimize, denormalize, and join datasets with AWS Glue Studio
  • Use Amazon S3 events to trigger a Lambda process to transform a file
  • Load data into a Redshift data warehouse and run queries with ease
  • Visualize and explore data using Amazon QuickSight
  • Extract sentiment data from a dataset using Amazon Comprehend
  • Build transactional data lakes using Apache Iceberg with Amazon Athena
  • Learn how a data mesh approach can be implemented on AWS

Who this book is for

This book is for data engineers, data analysts, and data architects who are new to AWS and looking to extend their skills to the AWS cloud. Anyone new to data engineering who wants to learn about the foundational concepts, while gaining practical experience with common data engineering services on AWS, will also find this book useful. A basic understanding of big data-related topics and Python coding will help you get the most out of this book, but it’s not a prerequisite. Familiarity with the AWS console and core services will also help you follow along.

A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL
by Roberto Zagni

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.

This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work.

By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

What you will learn

  • Create a dbt Cloud account and understand the ELT workflow
  • Combine Snowflake and dbt for building modern data engineering pipelines
  • Use SQL to transform raw data into usable data, and test its accuracy
  • Write dbt macros and use Jinja to apply software engineering principles
  • Test data and transformations to ensure reliability and data quality
  • Build a lightweight pragmatic data platform using proven patterns
  • Write easy-to-maintain idempotent code using dbt materialization

Who this book is for

This book is for data engineers, analytics engineers, BI professionals, and data analysts who want to learn how to build simple, futureproof, and maintainable data platforms in an agile way. Project managers, data team managers, and decision makers looking to understand the importance of building a data platform and foster a culture of high-performing data teams will also find this book useful. Basic knowledge of SQL and data modeling will help you get the most out of the many layers of this book. The book also includes primers on many data-related subjects to help juniors get started.

A practical guide to operationalizing scalable data analytics systems on GCP
by Adi Wijaya

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.

Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.

By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

What you will learn

  • Load data into BigQuery and materialize its output for downstream consumption
  • Build data pipeline orchestration using Cloud Composer
  • Develop Airflow jobs to orchestrate and automate a data warehouse
  • Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster
  • Leverage Pub/Sub for messaging and ingestion for event-driven systems
  • Use Dataflow to perform ETL on streaming data
  • Unlock the power of your data with Data Studio
  • Calculate the GCP cost estimation for your end-to-end data solutions

Who this book is for

This book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You'll find this book useful if you are preparing to take Google's Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

People, Processes, and Tools to Operationalize Data Trustworthiness
by Evren Eryurek, Uri Gilad, Valliappa Lakshmanan, Anita Kibunguchy-Grant and Jessi Ashdown

As you move data to the cloud, you need to consider a comprehensive approach to data governance, along with well-defined and agreed-upon policies to ensure your organization meets compliance requirements. Data governance incorporates the ways people, processes, and technology work together to ensure data is trustworthy and can be used effectively. This practical guide shows you how to effectively implement and scale data governance throughout your organization.

Chief information, data, and security officers and their teams will learn strategy and tooling to support democratizing data and unlocking its value while enforcing security, privacy, and other governance standards. Through good data governance, you can inspire customer trust, enable your organization to identify business efficiencies, generate more competitive offerings, and improve customer experience. This book shows you how.

You'll learn:

  • Data governance strategies addressing people, processes, and tools
  • Benefits and challenges of a cloud-based data governance approach
  • How data governance is conducted from ingest to preparation and use
  • How to handle the ongoing improvement of data quality
  • Challenges and techniques in governing streaming data
  • Data protection for authentication, security, backup, and monitoring
  • How to build a data culture in your organization
Modern Data Architecture with Data Mesh and Data Fabric
by Piethein Strengholt

As data management continues to evolve rapidly, managing all of your data in a central place, such as a data warehouse, is no longer scalable. Today's world is about quickly turning data into value. This requires a paradigm shift in the way we federate responsibilities, manage data, and make it available to others. With this practical book, you'll learn how to design a next-gen data architecture that takes into account the scale you need for your organization.

Executives, architects and engineers, analytics teams, and compliance and governance staff will learn how to build a next-gen data landscape. Author Piethein Strengholt provides blueprints, principles, observations, best practices, and patterns to get you up to speed.

  • Examine data management trends, including regulatory requirements, privacy concerns, and new developments such as data mesh and data fabric
  • Go deep into building a modern data architecture, including cloud data landing zones, domain-driven design, data product design, and more
  • Explore data governance and data security, master data management, self-service data marketplaces, and the importance of metadata
Delivering Data-Driven Value at Scale
by Zhamak Dehghani

We're at an inflection point in data, where our data management solutions no longer match the complexity of organizations, the proliferation of data sources, and the scope of our aspirations to get value from data with AI and analytics. In this practical book, author Zhamak Dehghani introduces data mesh, a decentralized sociotechnical paradigm drawn from modern distributed architecture that provides a new approach to sourcing, sharing, accessing, and managing analytical data at scale.

Dehghani guides practitioners, architects, technical leaders, and decision makers on their journey from traditional big data architecture to a distributed and multidimensional approach to analytical data management. Data mesh treats data as a product, considers domains as a primary concern, applies platform thinking to create self-serve data infrastructure, and introduces a federated computational model of data governance.

  • Get a complete introduction to data mesh principles and its constituents
  • Design a data mesh architecture
  • Guide a data mesh strategy and execution
  • Navigate organizational design to a decentralized data ownership model
  • Move beyond traditional data warehouses and lakes to a distributed data mesh
by Jacek Majchrzak, Sven Balnojan, Marian Siwiak and Mariusz Sieraczkiewicz

Revolutionize the way your organization approaches data with a data mesh! This new decentralized architecture outpaces monolithic lakes and warehouses and can work for a company of any size.

In Data Mesh in Action you will learn how to:

  • Implement a data mesh in your organization
  • Turn data into a data product
  • Move from your current data architecture to a data mesh
  • Identify data domains, and decompose an organization into smaller, manageable domains
  • Set up the central governance and local governance levels over data
  • Balance responsibilities between the two levels of governance
  • Establish a platform that allows efficient connection of distributed data products and automated governance

Data Mesh in Action reveals how this groundbreaking architecture looks for both startups and large enterprises. You won’t need any new technology—this book shows you how to start implementing a data mesh with flexible processes and organizational change. You’ll explore both an extended case study and real-world examples. As you go, you’ll be expertly guided through discussions around Socio-Technical Architecture and Domain-Driven Design with the goal of building a sleek data-as-a-product system. Plus, dozens of workshop techniques for both in-person and remote meetings help you onboard colleagues and drive a successful transition.

A practical guide to accelerating Snowflake development using universal data modeling techniques
by Serge Gershkovich

The Snowflake Data Cloud is one of the fastest-growing platforms for data warehousing and application workloads. Snowflake's scalable, cloud-native architecture and expansive set of features and objects enables you to deliver data solutions quicker than ever before.

Yet, we must ensure that these solutions are developed using recommended design patterns and accompanied by documentation that’s easily accessible to everyone in the organization.

This book will help you get familiar with simple and practical data modeling frameworks that accelerate agile design and evolve with the project from concept to code. These universal principles have helped guide database design for decades, and this book pairs them with unique Snowflake-native objects and examples like never before – giving you a two-for-one crash course in theory as well as direct application.

By the end of this Snowflake book, you’ll have learned how to leverage Snowflake’s innovative features, such as time travel, zero-copy cloning, and change-data-capture, to create cost-effective, efficient designs through time-tested modeling principles that are easily digestible when coupled with real-world examples.

What you will learn

  • Discover the time-saving benefits and applications of data modeling
  • Learn about Snowflake’s cloud-native architecture and its features
  • Understand and apply modeling techniques using Snowflake objects
  • Universal modeling concepts and language through Snowflake objects
  • Get comfortable reading and transforming semistructured data
  • Learn directly with pre-built recipes and examples
  • Learn to apply modeling frameworks from Star to Data Vault

Who this book is for

This book is for developers working with SQL who are looking to build a strong foundation in modeling best practices and gain an understanding of where they can be effectively applied to save time and effort. Whether you’re an ace in SQL logic or starting out in database design, this book will equip you with the practical foundations of data modeling to guide you on your data journey with Snowflake. Developers who’ve recently discovered Snowflake will be able to uncover its core features and learn to incorporate them into universal modeling frameworks.

Moving and Processing Dta for Analytics
by James Densmore

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack.

You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions.

You'll learn:

  • What a data pipeline is and how it works
  • How data is moved and processed on modern data infrastructure, including cloud platforms
  • Common tools and products used by data engineers to build pipelines
  • How pipelines support analytics and reporting needs
  • Considerations for pipeline maintenance, testing, and alerting
by Bas P. Harenslak and Julian Rutger de Ruiter

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples,

Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

A Practitioner's Guide to Building Trustworthy Data Pipelines
by Barr Moses, Lior Gavish and Molly Vorwerck

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you.

Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.

  • Build more trustworthy and reliable data pipelines
  • Write scripts to make data checks and identify broken pipelines with data observability
  • Learn how to set and maintain data SLAs, SLIs, and SLOs
  • Develop and lead data quality initiatives at your company
  • Learn how to treat data services and systems with the diligence of production software
  • Automate data lineage graphs across your data ecosystem
  • Build anomaly detectors for your critical data assets
Discovering, Analyzing, Visualizing and Presenting Data
by EMC Education Services

Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software.

This book will help you:

  • Become a contributor on a data science team
  • Deploy a structured lifecycle approach to data analytics problems
  • Apply appropriate analytic techniques and tools to analyzing big data
  • Learn how to tell a compelling story with data to drive business action
  • Prepare for EMC Proven Professional Data Science Certification

Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today!

Five real-world Python projects
by Leonard Apeltsin

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.

In Data Science Bookcamp you will learn:

  • Techniques for computing and plotting probabilities
  • Statistical analysis using Scipy
  • How to organize datasets with clustering algorithms
  • How to visualize complex multi-variable datasets
  • How to train a decision tree machine learning algorithm

In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

What You Need to Know About Data Mining and Data-Analytic Thinking
by Foster Provost and Tom Fawcett

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the "data-analytic thinking" necessary for extracting useful knowledge and business value from the data you collect. This guide also helps you understand the many data-mining techniques in use today.

Based on an MBA course Provost has taught at New York University over the past ten years, Data Science for Business provides examples of real-world business problems to illustrate these principles. You’ll not only learn how to improve communication between business stakeholders and data scientists, but also how participate intelligently in your company’s data science projects. You’ll also discover how to think data-analytically, and fully appreciate how data science methods can support business decision-making.

  • Understand how data science fits in your organization—and how you can use it for competitive advantage
  • Treat data as a business asset that requires careful investment if you’re to gain real value
  • Approach business problems data-analytically, using the data-mining process to gather good data in the most appropriate way
  • Learn general concepts for actually extracting knowledge from data
  • Apply data science principles when interviewing data science job candidates
First Principles with Python
by Joel Grus

To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, and toolkits—but also understand the ideas and principles underlying them. Updated for Python 3.6, this second edition of Data Science from Scratch shows you how these tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with the hacking skills you need to get started as a data scientist. Packed with new material on deep learning, statistics, and natural language processing, this updated book shows you how to find the gems in today’s messy glut of data.

  • Get a crash course in Python
  • Learn the basics of linear algebra, statistics, and probability—and how and when they’re used in data science
  • Collect, explore, clean, munge, and manipulate data
  • Dive into the fundamentals of machine learning
  • Implement models such as k-nearest neighbors, Naïve Bayes, linear and logistic regression, decision trees, neural networks, and clustering
  • Explore recommender systems, natural language processing, network analysis, MapReduce, and databases
by Jesse C. Daniel

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!

by Stephen A. Thomas

You’ve got data to communicate. But what kind of visualization do you choose, how do you build your visualizations, and how do you ensure that they're up to the demands of the Web?

In Data Visualization with JavaScript, you’ll learn how to use JavaScript, HTML, and CSS to build practical visualizations for your data. Step-by-step examples walk you through creating, integrating, and debugging different types of visualizations and you'll be building basic visualizations (like bar, line, and scatter graphs) in no time.

You'll also learn how to:

  • Create tree maps, heat maps, network graphs, word clouds, and timelines
  • Map geographic data, and build sparklines and composite charts
  • Add interactivity and retrieve data with AJAX
  • Manage data in the browser and build data-driven web applications
  • Harness the power of the Flotr2, Flot, Chronoline.js, D3.js, Underscore.js, and Backbone.js libraries

If you already know your way around building a web page but aren’t quite sure how to build a good visualization, Data Visualization with JavaScript will help you get your feet wet without throwing you into the deep end. You’ll soon be well on your way to creating simple, powerful data visualizations.

Download the source code

by Ashley Davis

Data Wrangling with JavaScript is hands-on guide that will teach you how to create a JavaScript-based data processing pipeline, handle common and exotic data, and master practical troubleshooting strategies.

A Deep-Dive into How Distributed Data Systems Work
by Alex Petrov

When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it’s often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals.

Throughout the book, you’ll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You’ll discover that the most significant distinctions among many modern databases reside in subsystems that determine how storage is organized and how data is distributed.

This book examines:

  • Storage engines: Explore storage classification and taxonomy, and dive into B-Tree-based and immutable Log Structured storage engines, with differences and use-cases for each
  • Storage building blocks: Learn how database files are organized to build efficient storage, using auxiliary data structures such as Page Cache, Buffer Pool and Write-Ahead Log
  • Distributed systems: Learn step-by-step how nodes and processes connect and build complex communication patterns
  • Database clusters: Which consistency models are commonly used by modern databases and how distributed storage systems achieve consistency
Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh
by James Serra

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of these architectures to help data professionals understand the pros and cons of each. James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You'll learn what data lakehouses can help you achieve, as well as how to distinguish data mesh hype from reality. Best of all, you'll be able to determine the most appropriate data architecture for your needs. With this book, you'll:

  • Gain a working understanding of several data architectures
  • Learn the strengths and weaknesses of each approach
  • Distinguish data architecture theory from reality
  • Pick the best architecture for your use case
  • Understand the differences between data warehouses and data lakes
  • Learn common data architecture concepts to help you build better solutions
  • Explore the historical evolution and characteristics of data architectures
  • Learn essentials of running an architecture design session, team organization, and project success factors

Free from product discussions, this book will serve as a timeless resource for years to come.

by Stephan Raaijmakers

Explore the most challenging issues of natural language processing, and learn how to solve them with cutting-edge deep learning!

Inside Deep Learning for Natural Language Processing you’ll find a wealth of NLP insights, including:

  • An overview of NLP and deep learning
  • One-hot text representations
  • Word embeddings
  • Models for textual similarity
  • Sequential NLP
  • Semantic role labeling
  • Deep memory-based NLP
  • Linguistic structure
  • Hyperparameters for deep NLP

Deep learning has advanced natural language processing to exciting new levels and powerful new applications! For the first time, computer systems can achieve "human" levels of summarizing, making connections, and other tasks that require comprehension and context.

Deep Learning for Natural Language Processing reveals the groundbreaking techniques that make these innovations possible. Stephan Raaijmakers distills his extensive knowledge into useful best practices, real-world applications, and the inner workings of top NLP algorithms.

The Business intelligence for Microsoft Power BI, SQL Server Analysis Services, and Excel
by Alberto Ferrari and Marco Russo

Now expanded and updated with modern best practices, this is the most complete guide to Microsoft's DAX language for business intelligence, data modeling, and analytics. Expert Microsoft BI consultants Marco Russo and Alberto Ferrari help you master everything from table functions through advanced code and model optimization. You'll learn exactly what happens under the hood when you run a DAX expression, and use this knowledge to write fast, robust code. This edition focuses on examples you can build and run with the free Power BI Desktop, and helps you make the most of the powerful syntax of variables (VAR) in Power BI, Excel, or Analysis Services. Want to leverage all of DAX's remarkable capabilities? This no-compromise "deep dive" is exactly what you need.

Perform powerful data analysis with DAX for Power BI, SQL Server, and Excel

  • Master core DAX concepts, including calculated columns, measures, and calculation groups
  • Work efficiently with basic and advanced table functions
  • Understand evaluation contexts and the CALCULATE and CALCULATETABLE functions
  • Perform time-based calculations
  • Use calculation groups and calculation items
  • Use syntax of variables (VAR) to write more readable, maintainable code
  • Express diverse and unusual relationships with DAX, including many-to-many relationships and bidirectional filters
  • Master advanced optimization techniques, and improve performance in aggregations
  • Optimize data models to achieve better compression
  • Measure DAX query performance with DAX Studio and learn how to optimize your DAX
Modern Data Lakehouse Architectures with Delta Lake
by Bennie Haelen and Dan Davis

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS.

This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights.

You'll learn how to:

  • Use modern data management and data engineering techniques
  • Understand how ACID transactions bring reliability to data lakes at scale
  • Run streaming and batch jobs against your data lake concurrently
  • Execute update, delete, and merge commands against your data lake
  • Use time travel to roll back and examine previous data versions
  • Build a streaming data quality pipeline following the medallion architecture
by Nicolas Vandeput

Lead your demand planning process to excellence and deliver real value to your supply chain.

In Demand Forecasting Best Practices you’ll learn how to:

  • Lead your team to improve quality while reducing workload
  • Properly define the objectives and granularity of your demand planning
  • Use intelligent KPIs to track accuracy and bias
  • Identify areas for process improvement
  • Help planners and stakeholders add value
  • Determine relevant data to collect and how best to collect it
  • Utilize different statistical and machine learning models

An expert demand forecaster can help an organization avoid overproduction, reduce waste, and optimize inventory levels for a real competitive advantage. Demand Forecasting Best Practices teaches you how to become that virtuoso demand forecaster.

This one-of-a-kind guide reveals forecasting tools, metrics, models, and stakeholder management techniques for delivering more effective supply chains. Everything you learn has been proven and tested in a live business environment. Discover author Nicolas Vandeput’s original five step framework for demand planning excellence and learn how to tailor it to your own company’s needs. Illustrations and real-world examples make each concept easy to understand and easy to follow. You’ll soon be delivering accurate predictions that are driving major business value.

by Danil Zburivsky and Lynda Partner

Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services.

Designing Cloud Data Platforms is a hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you'll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You'll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyze it.

Patterns and Paradigms for Scalable, Reliable Services
by Brendan Burns

Without established design patterns to guide them, developers have had to build distributed systems from scratch, and most of these systems are very unique indeed. Today, the increasing use of containers has paved the way for core distributed system patterns and reusable containerized components. This practical guide presents a collection of repeatable, generic patterns to help make the development of reliable distributed systems far more approachable and efficient.

Author Brendan Burns—Director of Engineering at Microsoft Azure—demonstrates how you can adapt existing software design patterns for designing and building reliable distributed applications. Systems engineers and application developers will learn how these long-established patterns provide a common language and framework for dramatically increasing the quality of your system.

  • Understand how patterns and reusable components enable the rapid development of reliable distributed systems
  • Use the side-car, adapter, and ambassador patterns to split your application into a group of containers on a single machine
  • Explore loosely coupled multi-node distributed patterns for replication, scaling, and communication between the components
  • Learn distributed system patterns for large-scale batch data processing covering work-queues, event-based processing, and coordinated workflows
Use Python to Tackle Your Toughest Business Challenges
by Bradford Tuckfield

Dive into the exciting world of data science with this practical introduction. Packed with essential skills and useful examples, Dive Into Data Science will show you how to obtain, analyze, and visualize data so you can leverage its power to solve common business challenges.

With only a basic understanding of Python and high school math, you’ll be able to effortlessly work through the book and start implementing data science in your day-to-day work. From improving a bike sharing company to extracting data from websites and creating recommendation systems, you’ll discover how to find and use data-driven solutions to make business decisions.

Topics covered include conducting exploratory data analysis, running A/B tests, performing binary classification using logistic regression models, and using machine learning algorithms.

You’ll also learn how to:

  • Forecast consumer demand
  • Optimize marketing campaigns
  • Reduce customer attrition
  • Predict website traffic
  • Build recommendation systems

With this practical guide at your fingertips, harness the power of programming, mathematical theory, and good old common sense to find data-driven solutions that make a difference. Don’t wait; dive right in!

Use Programming to Explore Algebra, Statistics, Calculus, and More!
by Amit Saha

Doing Math with Python shows you how to use Python to delve into high school–level math topics like statistics, geometry, probability, and calculus. You’ll start with simple projects, like a factoring program and a quadratic-equation solver, and then create more complex projects once you’ve gotten the hang of things.

Along the way, you’ll discover new ways to explore math and gain valuable programming skills that you’ll use throughout your study of math and computer science. Learn how to:

  • Describe your data with statistics, and visualize it with line graphs, bar charts, and scatter plots
  • Explore set theory and probability with programs for coin flips, dicing, and other games of chance
  • Solve algebra problems using Python’s symbolic math functions
  • Draw geometric shapes and explore fractals like the Barnsley fern, the Sierpinski triangle, and the Mandelbrot set
  • Write programs to find derivatives and integrate functions

Creative coding challenges and applied examples help you see how you can put your new math and coding skills into practice. You’ll write an inequality solver, plot gravity’s effect on how far a bullet will travel, shuffle a deck of cards, estimate the area of a circle by throwing 100,000 “darts” at a board, explore the relationship between the Fibonacci sequence and the golden ratio, and more.

Whether you’re interested in math but have yet to dip into programming or you’re a teacher looking to bring programming into the classroom, you’ll find that Python makes programming easy and practical. Let Python handle the grunt work while you focus on the math.

Uses Python 3

Take Control of Your Data with Fundamental Linear Algebra, Probability, and Statistics
by Thomas Nield

Master the math needed to excel in data science, machine learning, and statistics. In this book author Thomas Nield guides you through areas like calculus, probability, linear algebra, and statistics and how they apply to techniques like linear regression, logistic regression, and neural networks. Along the way you'll also gain practical insights into the state of data science and how to use those insights to maximize your career.

Learn how to:

  • Use Python code and libraries like SymPy, NumPy, and scikit-learn to explore essential mathematical concepts like calculus, linear algebra, statistics, and machine learning
  • Understand techniques like linear regression, logistic regression, and neural networks in plain English, with minimal mathematical notation and jargon
  • Perform descriptive statistics and hypothesis testing on a dataset to interpret p-values and statistical significance
  • Manipulate vectors and matrices and perform matrix decomposition
  • Integrate and build upon incremental knowledge of calculus, probability, statistics, and linear algebra, and apply it to regression models including neural networks
  • Navigate practically through a data science career and avoid common pitfalls, assumptions, and biases while tuning your skill set to stand out in the job market
The science and strategy of customer retention
by Carl S. Gold

The beating heart of any product or service business is returning clients. Don't let your hard-won customers vanish, taking their money with them. In Fighting Churn with Data you'll learn powerful data-driven techniques to maximize customer retention and minimize actions that cause them to stop engaging or unsubscribe altogether. This hands-on guide is packed with techniques for converting raw data into measurable metrics, testing hypotheses, and presenting findings that are easily understandable to non-technical decision makers.

Plan and Build Robust Data Systems
by Joe Reis and Matt Housley

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:

  • Get a concise overview of the entire data engineering landscape
  • Assess data engineering problems using an end-to-end framework of best practices
  • Cut through marketing hype when choosing data technologies, architecture, and processes
  • Use the data engineering lifecycle to design and build a robust architecture
  • Incorporate data governance and security across the data engineering lifecycle
Implement Trustworthy End-to-End Data Solutions
by Andy Petrella

Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practices that enables data teams to gain greater visibility of data and its usage. If you're a data engineer, data architect, or machine learning engineer who depends on the quality of your data, this book shows you how to focus on the practical aspects of introducing data observability in your everyday work.

Author Andy Petrella helps you build the right habits to identify and solve data issues, such as data drifts and poor quality, so you can stop their propagation in data applications, pipelines, and analytics. You'll learn ways to introduce data observability, including setting up a framework for generating and collecting all the information you need.

  • Learn the core principles and benefits of data observability
  • Use data observability to detect, troubleshoot, and prevent data issues
  • Follow the book's recipes to implement observability in your data projects
  • Use data observability to create a trustworthy communication framework with data consumers
  • Learn how to educate your peers about the benefits of data observability
A Primer on Making Informative and Compelling Figures
by Claus O. Wilke

Effective visualization is the best way to communicate information from the increasingly large and complex datasets in the natural and social sciences. But with the increasing power of visualization software today, scientists, engineers, and business analysts often have to navigate a bewildering array of visualization choices and options.

This practical book takes you through many commonly encountered visualization problems, and it provides guidelines on how to turn large datasets into clear and compelling figures. What visualization type is best for the story you want to tell? How do you make informative figures that are visually pleasing? Author Claus O. Wilke teaches you the elements most critical to successful data visualization.

  • Explore the basic concepts of color as a tool to highlight, distinguish, or represent a value
  • Understand the importance of redundant coding to ensure you provide key information in multiple ways
  • Use the book’s visualizations directory, a graphical guide to commonly used types of data visualizations
  • Get extensive examples of good and bad figures
  • Learn how to use figures in a document or report and how employ them effectively to tell a compelling story
by Mark L. Gillenson

In the newly revised third edition of Fundamentals of Database Management Systems, veteran database expert Dr. Mark Gillenson delivers an authoritative and comprehensive account of contemporary database management. The Third Edition assists readers in understanding critical topics in the subject, including data modeling, relational database concepts, logical and physical database design, SQL, data administration, data security, NoSQL, blockchain, database in the cloud, and more.

The author offers a firm grounding in the fundamentals of database while, at the same time, providing a wide-ranging survey of database subfields relevant to information systems professionals. And, now included in the supplements, the author's audio narration of the included PowerPoint slides! Readers will also find:

  • Brand-new content on NoSQL database management, NewSQL, blockchain, and database-intensive applications, including data analytics, ERP, CRM, and SCM
  • Updated and revised narrative material designed to offer a friendly introduction to database management
  • Renewed coverage of cloud-based database management
  • Extensive updates to incorporate the transition from rotating disk secondary storage to solid state drives
by Chris Garrard

Geoprocessing with Python teaches you how to use the Python programming language, along with free and open source tools, to read, write, and process geospatial data.

by Ekaterina Kochmar

Hit the ground running with this in-depth introduction to the NLP skills and techniques that allow your computers to speak human.

In Getting Started with Natural Language Processing you’ll learn about:

  • Fundamental concepts and algorithms of NLP
  • Useful Python libraries for NLP
  • Building a search algorithm
  • Extracting information from raw text
  • Predicting sentiment of an input text
  • Author profiling
  • Topic labeling
  • Named entity recognition

Getting Started with Natural Language Processing is an enjoyable and understandable guide that helps you engineer your first NLP algorithms. Your tutor is Dr. Ekaterina Kochmar, lecturer at the University of Bath, who has helped thousands of students take their first steps with NLP. Full of Python code and hands-on projects, each chapter provides a concrete example with practical techniques that you can put into practice right away. If you’re a beginner to NLP and want to upgrade your applications with functions and features like information extraction, user profiling, and automatic topic labeling, this is the book for you.

Understanding data with graphs
by Philipp K. Janert

Gnuplot in Action, Second Edition is a major revision of this popular and authoritative guide for developers, engineers, and scientists who want to learn and use gnuplot effectively. Fully updated for gnuplot version 5, the book includes four pages of color illustrations and four bonus appendixes available in the eBook.

With examples in Neo4j
by Tomaž Bratanič

Practical methods for analyzing your data with graphs, revealing hidden connections and new insights.

Graphs are the natural way to represent and understand connected data. This book explores the most important algorithms and techniques for graphs in data science, with concrete advice on implementation and deployment. You don’t need any graph experience to start benefiting from this insightful guide. These powerful graph algorithms are explained in clear, jargon-free text and illustrations that makes them easy to apply to your own projects.

In Graph Algorithms for Data Science you will learn:

  • Labeled-property graph modeling
  • Constructing a graph from structured data such as CSV or SQL
  • NLP techniques to construct a graph from unstructured data
  • Cypher query language syntax to manipulate data and extract insights
  • Social network analysis algorithms like PageRank and community detection
  • How to translate graph structure to a ML model input with node embedding models
  • Using graph features in node classification and link prediction workflows

Graph Algorithms for Data Science is a hands-on guide to working with graph-based data in applications like machine learning, fraud detection, and business data analysis. It’s filled with fascinating and fun projects, demonstrating the ins-and-outs of graphs. You’ll gain practical skills by analyzing Twitter, building graphs with NLP techniques, and much more.

Examples in Gremlin
by Dave Bechberger and Josh Perryman

Relationships in data often look far more like a web than an orderly set of rows and columns. Graph databases shine when it comes to revealing valuable insights within complex, interconnected data such as demographics, financial records, or computer networks. In

Graph Databases in Action, experts Dave Bechberger and Josh Perryman illuminate the design and implementation of graph databases in real-world applications. You'll learn how to choose the right database solutions for your tasks, and how to use your new knowledge to build agile, flexible, and high-performing graph-powered applications!

by Chuck Lam

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming.

by Alex Holmes

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere

Storage and Analysis at Internet Scale
by Tom White

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing.

  • Learn fundamental components such as MapReduce, HDFS, and YARN
  • Explore MapReduce in depth, including steps for developing applications with it
  • Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
  • Learn two data formats: Avro for data serialization and Parquet for nested data
  • Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
  • Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
  • Learn the HBase distributed database and the ZooKeeper distributed configuration service
Learn how to effectively prepare data for successful data analytics
by Roy Jafari

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who's developed college-level courses on data preprocessing and related subjects.

With this book, you'll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data.

You'll learn about different technical and analytical aspects of data preprocessing - data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment.

The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you'll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.

What you will learn

  • Use Python to perform analytics functions on your data
  • Understand the role of databases and how to effectively pull data from databases
  • Perform data preprocessing steps defined by your analytics goals
  • Recognize and resolve data integration challenges
  • Identify the need for data reduction and execute it
  • Detect opportunities to improve analytics with data transformation

Who this book is for

This book is for junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data. You don't need any prior experience with data preprocessing to get started with this book. However, basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are a prerequisite.

by Nicholas Dimiduk and Amandeep Khurana

HBase in Action has all the knowledge you need to design, build, and run applications using HBase. First, it introduces you to the fundamentals of distributed systems and large scale data handling. Then, you'll explore real-world applications and code samples with just enough theory to understand the practical techniques. You'll see how to build applications with HBase and take advantage of the MapReduce processing framework. And along the way you'll learn patterns and best practices.

An Introduction to Designing with D3
by Scott Murray

Create and publish your own interactive data visualization projects on the webâ??even if you have little or no experience with data visualization or web development. Itâ??s inspiring and fun with this friendly, accessible, and practical hands-on introduction. This fully updated and expanded second edition takes you through the fundamental concepts and methods of D3, the most powerful JavaScript library for expressing data visually in a web browser.

Ideal for designers with no coding experience, reporters exploring data journalism, and anyone who wants to visualize and share data, this step-by-step guide will also help you expand your web programming skills by teaching you the basics of HTML, CSS, JavaScript, and SVG.

  • Learn D3 with downloadable code and over 140 examples
  • Create bar charts, scatter plots, pie charts, stacked bar charts, and force-directed graphs
  • Use smooth, animated transitions to show changes in your data
  • Introduce interactivity to help users explore your data
  • Create custom geographic maps with panning, zooming, labels, and tooltips
  • Walk through the creation of a complete visualization project, from start to finish
  • Explore inspiring case studies with nine accomplished designers talking about their D3-based projects
Big data, machine learning, and more, using Python tools
by Davy Cielen, Arno D. B. Meysman and Mohamed Ali

Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science.

Automating SQL server tasks with PowerShell commands
by Chrissy LeMaire, Rob Sewell, Jess Pomfret and Cláudio Silva

If you work with SQL Server, dbatools is a lifesaver. This book will show you how to use this free and open source PowerShell module to automate just about every SQL server task you can imagine—all in just one month!

In Learn dbatools in a Month of Lunches you will learn how to:

  • Perform instance-to-instance and customized migrations
  • Automate security audits, tempdb configuration, alerting, and reporting
  • Schedule and monitor PowerShell tasks in SQL Server Agent
  • Bulk-import any type of data into SQL Server
  • Install dbatools in secure environments

Written by a group of expert authors including dbatools creator Chrissy LeMaire, Learn dbatools in a Month of Lunches teaches you techniques that will make you more effective—and efficient—than you ever thought possible. In twenty-eight lunchbreak lessons, you’ll learn the most important use cases of dbatools and the favorite functions of its core developers. Stabilize and standardize your SQL server environment, and simplify your tasks by building automation, alerting, and reporting with this powerful tool.

Use, manage, and build secure and scalable databases with PostgreSQL 16
by Luca Ferrari and Enrico Pirozzi

The latest edition of this PostgreSQL book will help you to start using PostgreSQL from absolute scratch, helping you to quickly understand the internal workings of the database. With a structured approach and practical examples, go on a journey that covers the basics, from SQL statements and how to run server-side programs, to configuring, managing, securing, and optimizing database performance.

This new edition will not only help you get to grips with all the recent changes within the PostgreSQL ecosystem but will also dig deeper into concepts like partitioning and replication with a fresh set of examples. The book is also equipped with Docker images for each chapter which makes the learning experience faster and easier. Starting with the absolute basics of databases, the book sails through to advanced concepts like window functions, logging, auditing, extending the database, configuration, partitioning, and replication. It will also help you seamlessly migrate your existing database system to PostgreSQL and contains a dedicated chapter on disaster recovery. Each chapter ends with practice questions to test your learning at regular intervals.

By the end of this book, you will be able to install, configure, manage, and develop applications against a PostgreSQL database.

What you will learn

  • Gain a deeper understanding of PostgreSQL internals like transactions, MVCC, security and replication
  • Enhance data management with PostgreSQL’s latest partitioning features
  • Choose the right replication strategy for your database
  • See concrete examples of how to migrate data from another database, perform backups and restores, monitor your PostgreSQL installation and more
  • Ensure security and compliance with schemas and user privileges
  • Create customized database functions and extensions
  • Get to grips with server-side programming, window functions, and triggers

Who this book is for

Learning PostgresSQL 16 book is for anyone interested in learning about the PostgreSQL database from scratch. Anyone looking to build robust data warehousing applications and scale the database for high-availability and performance using the latest features of PostgreSQL will also find this book useful. Although prior knowledge of PostgreSQL is not required, familiarity with databases is expected.

A Deceptively Simple Introduction to the Terrifyingly Beautiful World of Computers and Data Science
by Zed A. Shaw

Zed Shaw has created the world's most reliable system for learning Python. Follow it and you will succeed--just like the millions of beginners Zed has taught to date! You bring the discipline, persistence, and attention; the author supplies the masterful knowledge you need to succeed.

In Learn Python the Hard Way, Fifth Edition, you'll learn Python by working through 60 lovingly crafted exercises. Read them. Type in the code. Run it. Fix your mistakes. Repeat. As you do, you'll learn how a computer works, how to solve problems, and how to enjoy programming . . . even when it's driving you crazy.

  • Install a complete Python environment
  • Organize and write code
  • Fix and break code
  • Basic mathematics
  • Strings and text
  • Interact with users
  • Work with files
  • Looping and logic
  • Object-oriented programming
  • Data structures using lists and dictionaries
  • Modules, classes, and objects
  • Python packaging
  • Automated testing
  • Basic SQL for Data Science
  • Web scraping
  • Fixing bad data (munging)
  • The "Data" part of "Data Science"

It'll be frustrating at first. But if you keep trying, you'll get it--and it'll feel amazing! This course will reward you for every minute you put into it. Soon, you'll know one of the world's most powerful, popular programming languages. You'll be a Python programmer.

This Book Is Perfect For

  • Total beginners with zero programming experience
  • Junior developers who know one or two languages
  • Returning professionals who haven't written code in years
  • Aspiring Data Scientists or academics who need to learn to code
  • Seasoned professionals looking for a fast, simple crash course in Python for Data Science

Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

by Don Jones

Learn SQL Server Administration in a Month of Lunches is the perfect way to get started with SQL Server operations, including maintenance, backup and recovery, high availability, and performance monitoring. In about an hour a day over a month, you'll learn exactly what you can do, and what you shouldn't touch. Most importantly, you'll learn the day-to-day tasks and techniques you need to keep SQL Server humming along smoothly.

Improving Productivity for Business Processes and Workflows
by Paul Papanek Stork

Processing information efficiently is critical to the successful operation of modern organizations. One particularly helpful tool is Microsoft Power Automate, a low-code/no-code development platform designed to help tech-savvy users create and implement workflows. This practical book explains how small-business and enterprise users can replace manual work that takes days with an automated process you can set up in a few hours using Power Automate.

Paul Papanek Stork, principal architect at Don't Pa..Panic Consulting, provides a concise yet comprehensive overview of the foundational skills required to understand and work with Power Automate. You'll learn how to use these workflows, or flows, to automate repetitive tasks or complete business processes without manual intervention.

Whether you're transferring form responses to a list, managing document approvals, sending automatic reminders for overdue tasks, or archiving emails and attachments, these skills will help you:

  • Design and build flows with templates or from scratch
  • Select triggers and actions to automate a process
  • Add actions to a flow to retrieve and process information
  • Use functions to transform information
  • Control the logic of a process using conditional actions, loops, or parallel branches
  • Implement error checking to avoid potential problems
Transforming Data into Insights
by Jeremey Arnold

Microsoft Power BI is a data analytics and visualization tool powerful enough for the most demanding data scientists, but accessible enough for everyday use for anyone who needs to get more from data. The market has many books designed to train and equip professional data analysts to use Power BI, but few of them make this tool accessible to anyone who wants to get up to speed on their own.

This streamlined intro to Power BI covers all the foundational aspects and features you need to go from "zero to hero" with data and visualizations. Whether you work with large, complex datasets or work in Microsoft Excel, author Jeremey Arnold shows you how to teach yourself Power BI and use it confidently as a regular data analysis and reporting tool.

You'll learn how to:

  • Import, manipulate, visualize, and investigate data in Power BI
  • Approach solutions for both self-service and enterprise BI
  • Use Power BI in your organization's business intelligence strategy
  • Produce effective reports and dashboards
  • Create environments for sharing reports and managing data access with your team
  • Determine the right solution for using Power BI offerings based on size, security, and computational needs
Lightning-Fast Data Analytics
by Jules S. Damji, Brooke Wenig, Tathagata Das and Denny Lee

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:

  • Learn Python, SQL, Scala, or Java high-level Structured APIs
  • Understand Spark operations and SQL Engine
  • Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow
Finding Stories in Internet Data
by Lam Thuy Vo

Did fake Twitter accounts help sway a presidential election? What can Facebook and Reddit archives tell us about human behavior? In Mining Social Media, senior BuzzFeed reporter Lam Thuy Vo shows you how to use Python and key data analysis tools to find the stories buried in social media.

Whether you’re a professional journalist, an academic researcher, or a citizen investigator, you’ll learn how to use technical tools to collect and analyze data from social media sources to build compelling, data-driven stories.

Learn how to:

  • Write Python scripts and use APIs to gather data from the social web
  • Download data archives and dig through them for insights
  • Inspect HTML downloaded from websites for useful content
  • Format, aggregate, sort, and filter your collected data using Google Sheets
  • Create data visualizations to illustrate your discoveries
  • Perform advanced data analysis using Python, Jupyter Notebooks, and the pandas library
  • Apply what you’ve learned to research topics on your own

Social media is filled with thousands of hidden stories just waiting to be told. Learn to use the data-sleuthing tools that professionals use to write your own data-driven stories.

An Introduction for Scientists and Engineers
by Allen B. Downey

Modeling and Simulation in Python is a thorough but easy-to-follow introduction to physical modeling—that is, the art of describing and simulating real-world systems.

Readers are guided through modeling things like world population growth, infectious disease, bungee jumping, baseball flight trajectories, celestial mechanics, and more while simultaneously developing a strong understanding of fundamental programming concepts like loops, vectors, and functions.

Clear and concise, with a focus on learning by doing, the author spares the reader abstract, theoretical complexities and gets right to hands-on examples that show how to produce useful models and simulations.

Explore industry-ready rime series forecasting using modern machine learning and deep learning
by Manu Joseph

We live in a serendipitous era where the explosion in the quantum of data collected and a renewed interest in data-driven techniques such as machine learning (ML), has changed the landscape of analytics, and with it, time series forecasting. This book, filled with industry-tested tips and tricks, takes you beyond commonly used classical statistical methods such as ARIMA and introduces to you the latest techniques from the world of ML.

This is a comprehensive guide to analyzing, visualizing, and creating state-of-the-art forecasting systems, complete with common topics such as ML and deep learning (DL) as well as rarely touched-upon topics such as global forecasting models, cross-validation strategies, and forecast metrics. You’ll begin by exploring the basics of data handling, data visualization, and classical statistical methods before moving on to ML and DL models for time series forecasting. This book takes you on a hands-on journey in which you’ll develop state-of-the-art ML (linear regression to gradient-boosted trees) and DL (feed-forward neural networks, LSTMs, and transformers) models on a real-world dataset along with exploring practical topics such as interpretability.

By the end of this book, you’ll be able to build world-class time series forecasting systems and tackle problems in the real world.

What you will learn

  • Find out how to manipulate and visualize time series data like a pro
  • Set strong baselines with popular models such as ARIMA
  • Discover how time series forecasting can be cast as regression
  • Engineer features for machine learning models for forecasting
  • Explore the exciting world of ensembling and stacking models
  • Get to grips with the global forecasting paradigm
  • Understand and apply state-of-the-art DL models such as N-BEATS and Autoformer
  • Explore multi-step forecasting and cross-validation strategies

Who this book is for

The book is for data scientists, data analysts, machine learning engineers, and Python developers who want to build industry-ready time series models. Since the book explains most concepts from the ground up, basic proficiency in Python is all you need. Prior understanding of machine learning or forecasting will help speed up your learning. For experienced machine learning and forecasting practitioners, this book has a lot to offer in terms of advanced techniques and traversing the latest research frontiers in time series forecasting.

Covers MongoDB version 3.0
by Kyle Banker, Peter Bakkum, Shaun Verch, Douglas Garrett and Tim Hawkins

MongoDB in Action, Second Edition is a completely revised and updated version. It introduces MongoDB 3.0 and the document-oriented database model. This perfectly paced book gives you both the big picture you'll need as a developer and enough low-level detail to satisfy system engineers.

Powerful and Scalable Data Storage
by Shannon Bradshaw, Eoin Brazil and Kristina Chodorow

Manage your data with a system designed to support modern application development. Updated for MongoDB 4.2, the third edition of this authoritative and accessible guide shows you the advantages of using document-oriented databases. You’ll learn how this secure, high-performance system enables flexible data models, high availability, and horizontal scalability.

Authors Shannon Bradshaw, Eoin Brazil, and Kristina Chodorow provide guidance for database developers, advanced configuration for system administrators, and use cases for a variety of projects. NoSQL newcomers and experienced MongoDB users will find updates on querying, indexing, aggregation, transactions, replica sets, ops management, sharding and data administration, durability, monitoring, and security.

In six parts, this book shows you how to:

  • Work with MongoDB, perform write operations, find documents, and create complex queries
  • Index collections, aggregate data, and use transactions for your application
  • Configure a local replica set and learn how replication interacts with your application
  • Set up cluster components and choose a shard key for a variety of applications
  • Explore aspects of application administration and configure authentication and authorization
  • Use stats when monitoring, back up and restore deployments, and use system settings when deploying MongoDB
Learn to build apps that can understand people
by George-Bogdan Ivanov

Natural Language Processing (NLP) is a collection of techniques to analyze, interpret, and create human-understandable text and speech. Advances in machine learning have pushed NLP to new levels of accuracy and uncanny realism.

Natural Language Processing for Hackers lays out everything you need to crawl, clean, build, fine-tune, and deploy natural language models from scratch—all with easy-to-read Python code.

A Practical Introduction
by Yuli Vasiliev

Natural Language Processing with Python and spaCy will show you how to create NLP applications like chatbots, text-condensing scripts, and order-processing tools quickly and easily. You’ll learn how to leverage the spaCy library to extract meaning from text intelligently; how to determine the relationships between words in a sentence (syntactic dependency parsing); identify nouns, verbs, and other parts of speech (part-of-speech tagging); and sort proper nouns into categories like people, organizations, and locations (named entity recognizing). You’ll even learn how to transform statements into questions to keep a conversation going.

You’ll also learn how to:

  • Work with word vectors to mathematically find words with similar meanings (Chapter 5)
  • Identify patterns within data using spaCy's built-in displaCy visualizer (Chapter 7)
  • Automatically extract keywords from user input and store them in a relational database (Chapter 9)
  • Deploy a chatbot app to interact with users over the internet (Chapter 11)

“Try This” sections in each chapter encourage you to practice what you’ve learned by expanding the book’s example scripts to handle a wider range of inputs, add error handling, and build professional-quality applications.

By the end of the book, you’ll be creating your own NLP applications with Python and spaCy.

Building Language Applications with Hugging Face
by Lewis Tunstall, Leandro von Werra and Thomas Wolf

Since their introduction in 2017, transformers have quickly become the dominant architecture for achieving state-of-the-art results on a variety of natural language processing tasks. If you're a data scientist or coder, this practical book -now revised in full color- shows you how to train and scale these large models using Hugging Face Transformers, a Python-based deep learning library.

Transformers have been used to write realistic news stories, improve Google Search queries, and even create chatbots that tell corny jokes. In this guide, authors Lewis Tunstall, Leandro von Werra, and Thomas Wolf, among the creators of Hugging Face Transformers, use a hands-on approach to teach you how transformers work and how to integrate them in your applications. You'll quickly learn a variety of tasks they can help you solve.

  • Build, debug, and optimize transformer models for core NLP tasks, such as text classification, named entity recognition, and question answering
  • Learn how transformers can be used for cross-lingual transfer learning
  • Apply transformers in real-world scenarios where labeled data is scarce
  • Make transformer models efficient for deployment using techniques such as distillation, pruning, and quantization
  • Train transformers from scratch and learn how to scale to multiple GPUs and distributed environments
by Dan Sullivan

The Google Cloud Certified Professional Data Engineer Study Guide, provides everything you need to prepare for this important exam and master the skills necessary to land that coveted Google Cloud Professional Data Engineer certification. Beginning with a pre-book assessment quiz to evaluate what you know before you begin, each chapter features exam objectives and review questions, plus the online learning environment includes additional complete practice tests.

Written by Dan Sullivan, a popular and experienced online course author for machine learning, big data, and Cloud topics, Google Cloud Certified Professional Data Engineer Study Guide is your ace in the hole for deploying and managing analytics and machine learning applications.

  • Build and operationalize storage systems, pipelines, and compute infrastructure
  • Understand machine learning models and learn how to select pre-built models
  • Monitor and troubleshoot machine learning models
  • Design analytics and machine learning applications that are secure, scalable, and highly available.

This exam guide is designed to help you develop an in depth understanding of data engineering and machine learning on Google Cloud Platform.

by Daniel Y. Chen

Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple data sets.

Pandas for Everyone, 2nd Edition, brings together practical knowledge and insight for solving real problems with Pandas, even if youre new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world data science problems such as using regularization to prevent data overfitting, or when to use unsupervised machine learning methods to find the underlying structure in a data set.

New features to the second edition include:

  • Extended coverage of plotting and the seaborn data visualization library
  • Expanded examples and resources
  • Updated Python 3.9 code and packages coverage, including statsmodels and scikit-learn libraries
  • Online bonus material on geopandas, Dask, and creating interactive graphics with Altair

Chen gives you a jumpstart on using Pandas with a realistic data set and covers combining data sets, handling missing data, and structuring data sets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes.

Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability and introduces you to the wider Python data analysis ecosystem.

  • Work with DataFrames and Series, and import or export data
  • Create plots with matplotlib, seaborn, and pandas
  • Combine data sets and handle missing data
  • Reshape, tidy, and clean data sets so theyre easier to work with
  • Convert data types and manipulate text strings
  • Apply functions to scale data manipulations
  • Aggregate, transform, and filter large data sets with groupby
  • Leverage Pandas advanced date and time capabilities
  • Fit linear models using statsmodels and scikit-learn libraries
  • Use generalized linear modeling to fit models with different response variables
  • Compare multiple models to select the best one
  • Regularize to overcome overfitting and improve performance
  • Use clustering in unsupervised machine learning
by Boris Paskhaver

Take the next steps in your data science career! This friendly and hands-on guide shows you how to start mastering Pandas with skills you already know from spreadsheet software.

In Pandas in Action you will learn how to:

  • Import datasets, identify issues with their data structures, and optimize them for efficiency
  • Sort, filter, pivot, and draw conclusions from a dataset and its subsets
  • Identify trends from text-based and time-based data
  • Organize, group, merge, and join separate datasets
  • Use a GroupBy object to store multiple DataFrames

Pandas has rapidly become one of Python's most popular data analysis libraries. In Pandas in Action, a friendly and example-rich introduction, author Boris Paskhaver shows you how to master this versatile tool and take the next steps in your data science career. You’ll learn how easy Pandas makes it to efficiently sort, analyze, filter and munge almost any type of data.

by Leo S. Hsu and Regina O. Obe

In PostGIS in Action, Third Edition you will learn:

  • An introduction to spatial databases
  • Geometry, geography, raster, and topology spatial types, functions, and queries
  • Applying PostGIS to real-world problems
  • Extending PostGIS to web and desktop applications
  • Querying data from external sources using PostgreSQL Foreign Data Wrappers
  • Optimizing queries for maximum speed
  • Simplifying geometries for greater efficiency

PostGIS in Action, Third Edition teaches readers of all levels to write spatial queries for PostgreSQL. You’ll start by exploring vector-, raster-, and topology-based GIS before quickly progressing to analyzing, viewing, and mapping data. This fully updated third edition covers key changes in PostGIS 3.1 and PostgreSQL 13, including parallelization support, partitioned tables, and new JSON functions that help in creating web mapping applications.

Solve real-world Database Administration challenges with 180+ practical recipes and best practices
by Gianni Ciolli, Boriss Mejías, Jimmy Angelakos, Vibhor Kumar and Simon Riggs

PostgreSQL has seen a huge increase in its customer base in the past few years and is becoming one of the go-to solutions for anyone who has a database-specific challenge. This PostgreSQL book touches on all the fundamentals of Database Administration in a problem-solution format. It is intended to be the perfect desk reference guide.

This new edition focuses on recipes based on the new PostgreSQL 16 release. The additions include handling complex batch loading scenarios with the SQL MERGE statement, security improvements, running Postgres on Kubernetes or with TPA and Ansible, and more. This edition also focuses on certain performance gains, such as query optimization, and the acceleration of specific operations, such as sort. It will help you understand roles, ensuring high availability, concurrency, and replication. It also draws your attention to aspects like validating backups, recovery, monitoring, and scaling aspects. This book will act as a one-stop solution to all your real-world database administration challenges.

By the end of this book, you will be able to manage, monitor, and replicate your PostgreSQL 16 database for efficient administration and maintenance with the best practices from experts.

What you will learn

  • Discover how to improve batch data loading with the SQL MERGE statement
  • Use logical replication to apply large transactions in parallel
  • Improve your back up and recovery performance with server-side compression
  • Tackle basic to high-end and real-world PostgreSQL challenges with practical recipes
  • Monitor and fine-tune your database with ease
  • Learn to navigate the newly introduced features of PostgreSQL 16
  • Efficiently secure your PostgreSQL database with new and updated features

Who this book is for

This Postgres book is for database administrators, data architects, database developers, and anyone with an interest in planning and running live production databases using PostgreSQL 14. Those looking for hands-on solutions to any problem associated with PostgreSQL 14 administration will also find this book useful. Some experience with handling PostgreSQL databases will help you to make the most out of this book, however, it is a useful resource even if you are just beginning your Postgres journey

by Nina Zumel and John Mount

Practical Data Science with R, Second Edition takes a practice-oriented approach to explaining basic principles in the ever expanding field of data science. You’ll jump right to real-world use cases as you apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.

From Core Concepts to Applications Using Python
by Mike X Cohen

If you want to work in any computational or technical field, you need to understand linear algebra. As the study of matrices and operations acting upon them, linear algebra is the mathematical basis of nearly all algorithms and analyses implemented in computers. But the way it's presented in decades-old textbooks is much different from how professionals use linear algebra today to solve real-world modern applications.

This practical guide from Mike X Cohen teaches the core concepts of linear algebra as implemented in Python, including how they're used in data science, machine learning, deep learning, computational simulations, and biomedical data processing applications. Armed with knowledge from this book, you'll be able to understand, implement, and adapt myriad modern analysis methods and algorithms.

Ideal for practitioners and students using computer technology and algorithms, this book introduces you to:

  • The interpretations and applications of vectors and matrices
  • Matrix arithmetic (various multiplications and transformations)
  • Independence, rank, and inverses
  • Important decompositions used in applied linear algebra (including LU and QR)
  • Eigendecomposition and singular value decomposition
  • Applications including least-squares model fitting and principal components analysis
A Comprehensive Guide to Building Real-World NLP Systems
by Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta and Harshit Surana

Many books and courses tackle natural language processing (NLP) problems with toy use cases and well-defined datasets. But if you want to build, iterate, and scale NLP systems in a business setting and tailor them for particular industry verticals, this is your guide. Software engineers and data scientists will learn how to navigate the maze of options available at each step of the journey.

Through the course of the book, authors Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana will guide you through the process of building real-world NLP solutions embedded in larger product setups. You’ll learn how to adapt your solutions for different industry verticals such as healthcare, social media, and retail.

With this book, you’ll:

  • Understand the wide spectrum of problem statements, tasks, and solution approaches within NLP
  • Implement and evaluate different NLP applications using machine learning and deep learning methods
  • Fine-tune your NLP solution based on your business problem and industry vertical
  • Evaluate various algorithms and approaches for NLP product tasks, datasets, and stages
  • Produce software solutions following best practices around release, deployment, and DevOps for NLP systems
  • Understand best practices, opportunities, and the roadmap for NLP from a business and product leader’s perspective
by Avi Pfeffer

Practical Probabilistic Programming introduces the working programmer to probabilistic programming. In it, you'll learn how to use the PP paradigm to model application domains and then express those probabilistic models in code. Although PP can seem abstract, in this book you'll immediately work on practical examples, like using the Figaro language to build a spam filter and applying Bayesian and Markov networks, to diagnose computer system data problems and recover digital images.

by Kim Falk

Online recommender systems help users find movies, jobs, restaurants—even romance! There’s an art in combining statistics, demographics, and query terms to achieve results that will delight them. Learn to build a recommender system the right way: it can make or break your application!

50+ Essential Concepts Using R and Python
by Peter Bruce, Andrew Bruce and Peter Gedeck

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

  • Why exploratory data analysis is a key preliminary step in data science
  • How random sampling can reduce bias and yield a higher-quality dataset, even with big data
  • How the principles of experimental design yield definitive answers to questions
  • How to use regression to estimate outcomes and detect anomalies
  • Key classification techniques for predicting which categories a record belongs to
  • Statistical machine learning methods that "learn" from data
  • Unsupervised learning methods for extracting meaning from unlabeled data
Prediction with Statistics & Machine Learning
by Aileen Nielsen

Time series data analysis is increasingly important due to the massive production of such data through the internet of things, the digitalization of healthcare, and the rise of smart cities. As continuous monitoring and data collection become more common, the need for competent time series analysis with both statistical and machine learning techniques will increase.

Covering innovations in time series data analysis and use cases from the real world, this practical guide will help you solve the most common data engineering and analysis challengesin time series, using both traditional statistical and modern machine learning techniques. Author Aileen Nielsen offers an accessible, well-rounded introduction to time series in both R and Python that will have data scientists, software engineers, and researchers up and running quickly.

You’ll get the guidance you need to confidently:

  • Find and wrangle time series data
  • Undertake exploratory time series data analysis
  • Store temporal data
  • Simulate time series data
  • Generate and select features for a time series
  • Measure error
  • Forecast and classify time series with machine or deep learning
  • Evaluate accuracy and performance
by Michael Baron

Probability and Statistics for Computer Scientists, Third Edition helps students understand fundamental concepts of Probability and Statistics, general methods of stochastic modeling, simulation, queuing, and statistical data analysis; make optimal decisions under uncertainty; model and evaluate computer systems; and prepare for advanced probability-based courses. Written in a lively style with simple language and now including R as well as MATLAB, this classroom-tested book can be used for one- or two-semester courses.

Features:

  • Axiomatic introduction of probability
  • Expanded coverage of statistical inference and data analysis, including estimation and testing, Bayesian approach, multivariate regression, chi-square tests for independence and goodness of fit, nonparametric statistics, and bootstrap
  • Numerous motivating examples and exercises including computer projects
  • Fully annotated R codes in parallel to MATLAB
  • Applications in computer science, software engineering, telecommunications, and related areas
  • In-Depth yet Accessible Treatment of Computer Science-Related Topics

Starting with the fundamentals of probability, the text takes students through topics heavily featured in modern computer science, computer engineering, software engineering, and associated fields, such as computer simulations, Monte Carlo methods, stochastic processes, Markov chains, queuing theory, statistical inference, and regression. It also meets the requirements of the Accreditation Board for Engineering and Technology (ABET).

Start Writing Code to Wrangle, Analyze, and Visualize Data with R
by Joel Ross and Michael Freeman

Using data science techniques, you can transform raw data into actionable insights for domains ranging from urban planning to precision medicine. Programming Skills for Data Science brings together all the foundational skills you need to get started, even if you have no programming or data science experience.

Leading instructors Michael Freeman and Joel Ross guide you through installing and configuring the tools you need to solve professional-level data science problems, including the widely used R language and Git version-control system. They explain how to wrangle your data into a form where it can be easily used, analyzed, and visualized so others can see the patterns you've uncovered. Step by step, you'll master powerful R programming techniques and troubleshooting skills for probing data in new ways, and at larger scales.

Freeman and Ross teach through practical examples and exercises that can be combined into complete data science projects. Everything's focused on real-world application, so you can quickly start analyzing your own data and getting answers you can act upon. Learn to

  • Install your complete data science environment, including R and RStudio
  • Manage projects efficiently, from version tracking to documentation
  • Host, manage, and collaborate on data science projects with GitHub
  • Master R language fundamentals: syntax, programming concepts, and data structures
  • Load, format, explore, and restructure data for successful analysis
  • Interact with databases and web APIs
  • Master key principles for visualizing data accurately and intuitively
  • Produce engaging, interactive visualizations with ggplot and other R packages
  • Transform analyses into sharable documents and sites with R Markdown
  • Create interactive web data science applications with Shiny
  • Collaborate smoothly as part of a data science team

Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

Essential Tools for Working with Data
by Jake VanderPlas

Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools.

Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.

With this handbook, you'll learn how:

  • IPython and Jupyter provide computational environments for scientists using Python
  • NumPy includes the ndarray for efficient storage and manipulation of dense data arrays
  • Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data
  • Matplotlib includes capabilities for a flexible range of data visualizations
  • Scikit-learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms
Data Wrangling with pandas, NumPy & Jupyter
by Wes McKinney

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process.

Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.

  • Use the Jupyter notebook and IPython shell for exploratory computing
  • Learn basic and advanced features in NumPy
  • Get started with data analysis tools in the pandas library
  • Use flexible tools to load, clean, transform, merge, and reshape data
  • Create informative visualizations with matplotlib
  • Apply the pandas groupby facility to slice, dice, and summarize datasets
  • Analyze and manipulate regular and irregular time series data
  • Learn how to solve real-world data analysis problems with thorough, detailed examples
A Hands-On Introduction
by Yuli Vasiliev

You will discover Python’s rich set of built-in data structures for basic operations, as well as its robust ecosystem of open-source libraries for data science, including NumPy, pandas, scikit-learn, matplotlib, and more. Examples show how to load data in various formats, how to streamline, group, and aggregate data sets, and how to create charts, maps, and other visualizations. Later chapters go in-depth with demonstrations of real-world data applications, including using location data to power a taxi service, market basket analysis to identify items commonly purchased together, and machine learning to predict stock prices.

An Introduction to Using Anaconda, JupyterLab, and Python's Scientific Libraries
by Lee Vaughan

Python Tools for Scientists will introduce you to Python tools you can use in your scientific research, including Anaconda, Spyder, Jupyter Notebooks, JupyterLab, and numerous Python libraries. You’ll learn to use Python for tasks such as creating visualizations, representing geospatial information, simulating natural events, and manipulating numerical data.

Once you’ve built an optimal programming environment with Anaconda, you’ll learn how to organize your projects and use interpreters, text editors, notebooks, and development environments to work with your code. Following the book’s fast-paced Python primer, you’ll tour a range of scientific tools and libraries like scikit-learn and seaborn that you can use to manipulate and visualize your data, or analyze it with machine learning algorithms.

You’ll also learn how to:

  • Create isolated projects in virtual environments, build interactive notebooks, test code in the Qt console, and use Spyder’s interactive development features
  • Use Python’s built-in data types, write custom functions and classes, and document your code
  • Represent data with the essential NumPy, Matplotlib, and pandas libraries
  • Use Python plotting libraries like Plotly, HoloViews, and Datashader to handle large datasets and create 3D visualizations

Regardless of your scientific field, Python Tools for Scientists will show you how to choose the best tools to meet your research and computational analysis needs.

Import, Tidy, Transform, Visualize and Model Data
by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund

Use R to turn data into insight, knowledge, and understanding. With this practical book, aspiring data scientists will learn how to do data science with R and RStudio, along with the tidyverse—a collection of R packages designed to work together to make data science fast, fluent, and fun. Even if you have no programming experience, this updated edition will have you doing data science quickly.

You'll learn how to import, transform, and visualize your data and communicate the results. And you'll get a complete, big-picture understanding of the data science cycle and the basic tools you need to manage the details. Updated for the latest tidyverse features and best practices, new chapters show you how to get data from spreadsheets, databases, and websites. Exercises help you practice what you've learned along the way.

You'll understand how to:

  • Visualize: Create plots for data exploration and communication of results
  • Transform: Discover variable types and the tools to work with them
  • Import: Get data into R and in a form convenient for analysis
  • Program: Learn R tools for solving data problems with greater clarity and ease
  • Communicate: Integrate prose, code, and results with Quarto
Practical applications with deep learning
by Masato Hagiwara

In Real-world Natural Language Processing you will learn how to:

  • Design, develop, and deploy useful NLP applications
  • Create named entity taggers
  • Build machine translation systems
  • Construct language generation systems and chatbots
  • Use advanced NLP concepts such as attention and transfer learning

Real-world Natural Language Processing teaches you how to create practical NLP applications without getting bogged down in complex language theory and the mathematics of deep learning. In this engaging book, you’ll explore the core tools and techniques required to build a huge range of powerful NLP apps, including chatbots, language detectors, and text classifiers.

by Jan L. Harrington

Relational Database Design and Implementation: Clearly Explained, Fourth Edition, provides the conceptual and practical information necessary to develop a database design and management scheme that ensures data accuracy and user satisfaction while optimizing performance.

Database systems underlie the large majority of business information systems. Most of those in use today are based on the relational data model, a way of representing data and data relationships using only two-dimensional tables. This book covers relational database theory as well as providing a solid introduction to SQL, the international standard for the relational database data manipulation language.

The book begins by reviewing basic concepts of databases and database design, then turns to creating, populating, and retrieving data using SQL. Topics such as the relational data model, normalization, data entities, and Codd's Rules (and why they are important) are covered clearly and concisely. In addition, the book looks at the impact of big data on relational databases and the option of using NoSQL databases for that purpose.

  • Features updated and expanded coverage of SQL and new material on big data, cloud computing, and object-relational databases
  • Presents design approaches that ensure data accuracy and consistency and help boost performance
  • Includes three case studies, each illustrating a different database design challenge
  • Reviews the basic concepts of databases and database design, then turns to creating, populating, and retrieving data using SQL
Previous