Graphs as in graph theory

Graph Problems

Graph level tasks:

  • Describe properties of a graph
  • Predict how graph will behave
  • Find clusters/structure within graph
  • Describe change over time

Node level tasks:

  • Classify nodes based on community

Edge level tasks:

  • Predict when an edge will form
  • Identify anomalous edges

Vocabulary

  • In-degree/out-degree: Incoming and outgoing edges, respectively
  • Temporal graphs:
    • Edges can be added over time
    • Nodes and edges can be added over time
    • Nodes and edges can be added and removed over time

Characterizing Graphs

Danger

Don’t memorize all of the following stuff! Memorize one or two maybe, but not all

Graph Level Features

  • Diameter: Longest shortest path
  • Persistence: The smallest number of links whose removal increases the diameter or disconnects the graph
  • Average in degree/out degree
  • Nodes/edges added or deleted per time T
  • Distribution of degree over the network

Node Level Features

  • In degree/out degree
  • Degree centrality: Literally just the degree
    • Normalized degree centrality: That but normalized
  • Closeness centrality: Importance measured by how close a vertex is to other vertices (takes into account average distance to all other nodes)
  • Betweenness centrality: Number of shortest paths that pass through that vertex
  • Eigenvector centrality: A vertex’s importance is determined by importance of its neighborsco

We would like to predict when a link forms in a graph:

  • Recommend friends on Facebook
  • Figure out who will be new terrorists
  • Predict where money/resources will flow

On the flip side, if we can predict when a link will form, we can measure how unexpected a link is:

  • Fraudulent transactions in financial networks
  • Suspicious network traffic on a computer network

How can we turn this into a typical ML problem?

One way is to turn it into a supervised learning style problem.

  • For each pair of nodes, generate some features, and use nodes with links as positive examples and nodes without as negative
  • Examples of features:
    • How many neighbors the nodes have in common
    • Jaccard Coefficient
    • Number of length 3 paths between them

Supervised Learning

We want to predict whether or not a link will form between two nodes at time

Tip for final project

class_weight is an important hyperparameter for decision trees in sklearn because of class imbalance

Challenges with evaluating link prediction models:

  • Real world networks are very large and sparse
  • Links have direction (different from typical classification problems)
  • Networks change over time

Community Detection

Network topology

  • An edge is a bridge if removing it makes the graph disconnected
  • Bridges are weak ties connecting two different communities
  • Bridges are rare in real life, but can relax the definition: an edge is a “shortcut” bridge if removing it causes the distance between two terminal vertices to increase
    • The larger the distance, the weaker the tie is

Girvan-Newman Method

Results in a hierarchical partitioning of the graph

  • Remove the edges of highest betweenness first
  • Repeat the same step with the remaining graph
  • Continue this until the graph breaks down into individual nodes

As the graph breaks down into pieces, the tightly knit community structure is exposed