"Building Knowledge Graphs" Barrasa Jesus and Jim Webber 2023

(Barrasa Jesus and Jim Webber 2023)

Incredibly useful, knowledge graphs help organizations keep track of medical research, cybersecurity threat intelligence, GDPR compliance, web user engagement, and much more. They do so by storing interlinked descriptions of entities—objects, events, situations, or abstract concepts—and encoding the underlying information. How do you create a knowledge graph? And how do you move it from theory into production?Using hands-on examples, this practical book shows data scientists and data engineers how to build their own knowledge graphs. Authors Jesús Barrasa and Jim Webber from Neo4j illustrate common patterns for building knowledge graphs that solve many of today’s pressing knowledge management problems. You’ll quickly discover how these graphs become increasingly useful as you add data and augment them with algorithms and machine learning.Learn the organizing principles necessary to build a knowledge graphExplore how graph databases serve as a foundation for knowledge graphsUnderstand how to import structured and unstructured data into your graphFollow examples to build integration-and-search knowledge graphsLearn what pattern detection knowledge graphs help you accomplishExplore dependency knowledge graphs through examplesUse examples of natural language knowledge graphs and chatbotsUse graph algorithms and ML to gain insight into connected data

놀랍도록 유용한 지식 그래프는 조직이 의학 연구, 사이버 보안 위협 인텔리전스, GDPR 준수, 웹 사용자 참여 등을 추적하는 데 도움이 됩니다. 지식 그래프는 개체, 이벤트, 상황 또는 추상적인 개념과 같은 개체에 대한 상호 연결된 설명을 저장하고 기본 정보를 인코딩함으로써 이를 수행합니다. 지식 그래프는 어떻게 만들까요? 그리고 어떻게 이론에서 실제 작업으로 옮길 수 있을까요? 이 실용적인 책은 데이터 과학자와 데이터 엔지니어가 자신만의 지식 그래프를 구축하는 방법을 실습 예제를 통해 보여줍니다. Neo4j의 저자 Jesús Barrasa와 Jim Webber는 오늘날의 많은 시급한 지식 관리 문제를 해결하는 지식 그래프를 구축하는 일반적인 패턴을 설명합니다. 데이터를 추가하고 알고리즘과 머신 러닝으로 그래프를 보강하면서 이러한 그래프가 어떻게 점점 더 유용해지는지 금방 알게 될 것입니다.지식 그래프 구축에 필요한 구성 원칙 알아보기그래프 데이터베이스가 지식 그래프의 기반이 되는 방법 알아보기정형 및 비정형 데이터를 그래프로 가져오는 방법 이해하기예제를 따라 통합 및 검색 지식 그래프 구축하기패턴 감지 지식 그래프가 어떤 도움을 주는지 알아보기예제를 통해 의존성 지식 그래프 알아보기자연어 지식 그래프 및 챗봇의 예 사용그래프 알고리즘과 ML을 사용하여 연결된 데이터에 대한 인사이트 확보하기

Chapter 1. Introducing Knowledge Graphs

We're overwhelmed by data. It's everywhere and being collected at a fantastic rate and stored at substantial cost. But we're not necessarily getting value from that data, though there is significant value in it, if only we could understand it.

All is not lost. Over the last decade, a new category of technology based on graphs has moved from obscurity to prominence. Graphs have come to underpin everything from consumer-facing systems like navigation and social networks to critical infrastructure like supply chains and power grids.

These important graph use cases have reached a common conclusion: applying knowledge in context is the most powerful tool that most businesses have. A set of patterns and practices called knowledge graphs has been emerging to help understand data in context, where the context is represented as a graph of connected data items. With knowledge graphs, there is hope that you can distill business value from data. This book is an attempt to show how it can be done.

Knowledge graphs are useful because they provide contextualized understanding of data. Context derives from the layer of metadata (graph topology and other features) that provides rules for structure and interpretation. This book shows how the connected context provided by knowledge graphs enables you to extract greater value from existing data, drive automation and process optimization, improve predictions, and support an agile response to changing business environments.

We wrote this book for technology professionals who are interested in building and operating knowledge graphs within their businesses. In a way, it's a sequel to or expansion of O'Reilly's Knowledge Graphs: Data in Context for Responsive Businesses, which we had the pleasure of writing in 2021. That report was intended for chief information officers (CIOs) and chief data officers (CDOs), helping them to understand the benefits of knowledge graphs. This time we're aiming at data and software professionals who build sophisticated information systems.

The book is arranged in two parts. The first part deals with graph fundamentals, including graph databases, query languages, data wrangling, and graph data science. It teaches technical practitioners the fundamental tooling needed to understand the second part of this book, which tackles significant knowledge graph use cases and how to implement them, complete with code examples and system architectures.

For data and software professionals, this book provides an on-ramp to the world of knowledge graphs and a language for discussing their implementation with your peers and management. It also gives deep examples for how to build and use knowledge graphs---​from graph basics all the way to graph machine learning.

For CIOs or CDOs, this book may still be useful since it provides a good overview of knowledge graphs and how they are delivered. You can skim the earlier chapters and code examples if that's not your thing, but you'll still be able to understand what your practitioner colleagues are doing and why.

This chapter explains the background and motivation for knowledge graphs. Here we'll introduce the notions of graphs and graph data and start to show how to build smarter systems with knowledge graphs.

What Are Graphs?

Knowledge graphs are a type of graph, so it's important to have a basic understanding of graphs. Graphs are simple structures that use nodes (or vertices) connected by relationships (or edges) to create high-fidelity models of a domain. To avoid any confusion, the graphs in this book have nothing to do with visualizing data as histograms or plotting a function, which are called charts, as shown in .

  • Figure 1-1. Graphs versus charts

    The graphs in this book are sometimes referred to as networks. They are a simple but powerful way of showing how things connect.

    Graphs are not new. In fact, graph theory was invented by Swiss mathematician Leonhard Euler in the 18th century to help compute the minimum distance that the emperor of Prussia had to walk to see the town of Königsberg (modern-day Kaliningrad) by ensuring that each of its seven bridges was crossed only once, as shown in .

  • Figure 1-2. A graphical representation of Königsberg and its seven bridges across the Pregel River

    Euler's insight was that the problem shown in could be reduced to a logical form, stripping out all the noise of the real world and concentrating solely on how things are connected. He was able to demonstrate that the problem didn't need to involve bridges, islands, or emperors. He proved that in fact the physical geography of Königsberg was completely irrelevant.

    #+begin_note

  • Note

    To us, Euler's approach is similar to modern software development. The inherent noise of the real world is stripped away so that a more valuable logical representation---​the software---​remains. For software professionals, this feels comfortingly familiar. #+end_note

    You can use the superimposed graph in to figure out the shortest route for walking around Königsberg without having to put on your walking boots and try it for real. In fact, Euler proved that the emperor could not walk the whole town crossing each bridge only once, since there would have needed to be (at least) one island (node) with an even number of connecting bridges (relationships) from which the emperor could start his walk. No such island existed in Königsberg, so no such route (path) is possible.

    Building on Euler's work, mathematicians have studied various graph models, all variations on the theme of nodes connected by relationships. Some models allow relationships to be directed, where they have an explicit start and end node, while some have undirected relationships connecting nodes. Some models, like hypergraphs, allow relationships to connect multiple nodes.

    In theory, there's no single best graph model to choose (though you can usually transform from one model to another). But there are better or worse models in practice, especially for building computer systems. In this book we've chosen the labeled property graph model as the foundation. It's a popular model that is simple for software and data professionals to understand. It is expressive enough to represent even the most complicated domains and is information rich (unlike the graphs beloved of mathematicians).

    <aside data-type="sidebar" epub:type="sidebar">

The Property Graph Model

The property graph model is the most popular model for modern graph databases. Correspondingly, it's a common basis for creating knowledge graphs. It consists of the following elements:

Nodes representing entities in the domain : - Nodes can contain zero or more properties, which are key-value pairs representing entity data such as price or date of birth. - Nodes can have zero or more labels, which declare the node's purpose in the graph, such as representing a Customer or a Product.

Relationships representing how entities interrelate : - Relationships have a type, such as BOUGHT, FOLLOWS, or LIKES. - Relationships have a direction, going from one node to another (or back to the same node).

-   Relationships can contain zero or more properties, which are key-value pairs representing some characteristic of the link, such as a timestamp or distance.

-   Relationships never dangle---​there are always a start node and an end node (which can be the same node).

You can use these primitives (nodes, labels, relationships, and properties) along with rules to assemble sophisticated, high-fidelity graph data models with relative ease.

</aside>

shows a small social graph, but compared to the example in , this graph holds much more information.

  • Figure 1-3. A graph representing people, their friendships, and their locations

    In each node has a label that represents its role in the graph. Some nodes are labeled Person and some are labeled Place, representing people and places respectively. Stored inside those nodes are properties. For example, one node has name:'Rosa' and gender:'f' that you can interpret as being a female person called Rosa. Note that the Karl and Fred nodes have slightly different properties on them, which is a perfectly fine tool as it accommodates the messiness of the real world.

  • Tip

    The property graph model does not enforce any schemas based on labels or relationship types. It's deliberately intended to be flexible and accommodating to help develop high-fidelity models. If you really need to ensure that nodes with certain labels have certain properties, you can apply constraints to the label to ensure those properties exist, are unique, and so on. You can view this not as schema or schemaless, but schema-ish. The idea is to use schema-like constraints just where you need them rather than eagerly locking down the whole data model. Those schema-like constraints, like everything else in a graph data model, can change over time. Real-world data is often uneven and incomplete, so your knowledge graphs should reflect this reality.

    Between the nodes in you see relationships. The relationships are richer than in since they have a type, a direction, and optional properties. Relationships cannot “dangle”; they always have a start node and an end node, even if the start and end nodes are the same.

    For instance, there is a Person node with the property name:'Rosa' that has an outgoing LIVES_IN relationship with the property since: 2020 to the Place node with the property city:'Berlin'. You can read this as “Rosa has lived in Berlin since 2020,” based on the direction of the relationships. You can also see that Fred is a FRIEND of Karl and that Karl is a FRIEND of Fred. Rosa and Karl are also friends, but Rosa and Fred are not.

    In the property graph model, there are no limits on the number of nodes or the number or type of relationships that interconnect them. Some nodes are densely connected while others are sparsely connected. All that matters is that the model matches the problem domain. Similarly, some nodes have lots of properties, while some have few or none. Some relationships have lots of properties, but many have none at all. This is all perfectly normal for knowledge graphs.

    It's easy to see how the graph in can answer questions about friendships and who lives where. Extending the model to include other data like hobbies, publications, or jobs is also straightforward: just keep adding nodes and relationships to match your problem domain. Creating large, complex graphs with millions or billions of connections isn't a problem for modern graph databases and graph-processing software, so building even very large knowledge graphs is achievable.

    Graph data models can comfortably represent complex networks of relationships in a way that is both human readable and machine friendly. Graphs might seem very technical at first, but they are created from very simple primitives, making them very accessible in practice. In fact the combination of a simple data model and the ease of algorithmic processing to discover connections, patterns, and features is what has made graphs so popular. It's a powerful combination you will also exploit in your knowledge graphs.

The Motivation for Knowledge Graphs

Interest in knowledge graphs has exploded, with a myriad of research papers, solutions, analyst reports, groups, and conferences on this topic. Knowledge graphs have become so popular partly because graph technology has accelerated in recent years but also because there is strong demand to make sense of data.

External factors have undoubtedly accelerated knowledge graphs to greater prominence. Stresses from the COVID-19 pandemic and fallout from geopolitics have strained some organizations to the point of breaking. Decision making has never needed to be more rapid. At the same time, businesses are hampered by the lack of timely and accurate insight.

Now businesses are reconfiguring their operations and processes to flex rapidly. As historical knowledge ages faster and is invalidated by market dynamics, many organizations need new ways of capturing, analyzing, and learning from data. Businesses need rapid insights and recommendations, from customer experience and patient outcomes to product innovation, fraud detection, and automation. They need contextualized data to generate knowledge.

Knowledge Graphs: A Definition

Now you know a little about graphs and the motivation for using knowledge graphs. But clearly not all graphs are knowledge graphs (despite the hype). Knowledge graphs are a specific type of graph with an emphasis on contextual understanding. Knowledge graphs are interlinked sets of facts that describe real-world entities, events, or things and their interrelations in a human- and machine-understandable format.

Critically, knowledge graphs must have an organizing principle so that a user (or a computer system) can reason about the underlying data. The organizing principle provides an additional layer of structure that adds context to support knowledge discovery. The organizing principle makes the data itself smarter. This idea runs contrary to the norm where intelligence resides in applications and data is dumb, just something to be mined and refined. Having smarter data both simplifies systems and encourages broad reuse.

<aside data-type="sidebar" epub:type="sidebar">

Where's the Data?

A knowledge graph can be a self-contained unit which exists in a graph database, or it can involve several coordinated graph stores forming a federation of graphs. Or a knowledge graph can be built on top of a data lake to bring structure and knowledge to undifferentiated bulk storage. A knowledge graph could also be a logical layer providing structure and insight over multiple data sources of different kinds so that data consumers get a holistic, curated view of the data.

In principle, knowledge graphs are agnostic about the physical storage of the underlying data. They can support different architectural approaches, from virtualized ones where the knowledge graph is a smart index over externally stored data to the fully materialized ones where data is fully hosted in a graph platform, and any hybrid approach between the two.

</aside>

Summary

Organizing principles, reasoning, and knowledge discovery might seem complicated at first. But in reality, you can think of knowledge graphs as a rich index over data that provides curation, much like a skilled librarian recommending books and journals to a researcher.

From here on things get a little more technical. In you'll learn how we can extend the definition of an organizing principle to act as a contract between a knowledge graph and its consuming users and systems. You'll also learn about several different options for creating organizing principles.

OceanofPDF.com

Related-Notes

References

Barrasa Jesus, and Jim Webber. 2023. Building Knowledge Graphs. 1st edition. O’Reilly Media.