Case Study: Network analytics on corporate e-mail

Executive Summary

Modern organisations run on e-mail. But who actually influences information flow? Which people sit at the centre of cross-team communication, and where do we see potential bottlenecks or single points of failure?

In this case study we use a large anonymised corporate e-mail network (the EuAll SNAP dataset) to answer a simple but powerful question:

“Who e-mails whom – and what does that reveal about hidden influencers in the organisation?”

Business Problem

In large organisations, a small number of mailboxes typically carry a disproportionate share of communication: senior managers, project coordinators, operational hubs.

The EuAll email network allows us to quantify this pattern and to identify:

  • which mailboxes act as communication hubs or bottlenecks,
  • how unevenly email traffic is distributed across the organisation,
  • and what this implies for governance, risk, and change-management.

Data & Methods

The dataset is stored in Matrix Market (.mtx) format as a large, sparse adjacency matrix:

  • 265,214 nodes – anonymised e-mail accounts inside a European institution.
  • ~419,000 directed edges – each edge represents at least one e-mail from user i to user j.
  • There are no timestamps or message content; we only use the who-e-mailed-whom pattern.

From this we build a directed graph where:

  • In-degree = how many distinct people send e-mails to a given node.
  • Out-degree = how many distinct people a node sends e-mails to.
  • Total degree = in-degree + out-degree, used as a simple overall influence score.

We ignore self-loops and treat multiple e-mails between the same pair as one edge, which keeps the focus on relationship breadth rather than raw volume.

Methods and analytics

The pipeline implements several complementary views:

  1. Global degree distribution (log–log histogram).
    This shows how many people have 1, 2, 5, … incoming/outgoing connections and reveals whether the network is “flat” or dominated by a few hubs.
  2. Top in-degree and top out-degree rankings.
    • Top in-degree nodes are information magnets – many people write to them (e.g. shared mailboxes, senior leaders, helpdesks).
    • Top out-degree nodes are broadcast hubs – they contact many others (e.g. project managers, internal communications).
  3. In- vs. out-degree scatter plot (log–log).
    This separates roles:
    • Nodes high on both axes act as brokers and cross-team connectors.
    • High in-degree but low out-degree often signals escalation points or mailboxes where many issues land but few are initiated.
    • High out-degree but low in-degree suggests broadcasters whose messages fan out but who receive relatively little direct traffic.
  4. Circular “chord-style” diagram for the top influencers.
    We select the top 12 nodes by total degree and draw them on a circle. Directed links between them show who e-mails whom inside this elite core. Visually this mirrors a chord diagram: thick bundles of edges indicate strong mutual communication between key influencers; isolated spokes flag individuals who are highly connected to the wider organisation but less connected among themselves.

Together these views move from global structure to a focused lens on the informal leadership circle.

Want to reproduce these results?

If you’d like to explore the data and Python code behind these charts, get in touch and I’ll share the dataset and implementation details.

Enter your full name as it appears.
This field is required.
Please briefly describe what you're interested in.

Results / Charts

Highly unequal communication load

  • The degree distribution on a log–log scale is heavy-tailed; most mailboxes have 1–3 connections, while a few have several thousand.
  • The Gini coefficient of total degree is about 0.70, indicating strong inequality..
  • From the Lorenz curve:
    • The top 0.1% of mailboxes (~265 nodes) account for ≈ 31% of all email interactions.
    • The top 1% (~2,650 nodes) handle ≈ 55% of the traffic.
    • The top 10% already concentrate ≈ 68% of communication.
  • In other words, email traffic is heavily concentrated in a very small elite of mailboxes.

A large giant component with many small islands

  • When we ignore direction, the network splits into 15,836 connected components.
  • The largest component contains ≈ 225,000 nodes, roughly 85% of all mailboxes – this is the “core” corporate communication network.
  • The remaining 15% are tiny clusters and isolates: detached teams, external contacts, or dormant addresses.

Influencer nodes and communication roles

This heatmap maps fraud share across the joint space of age group and monthly income band — one of the most revealing combinations in the dataset.

  • The top-degree node (Node 179171) participates in 7,636 email relations:
    • 7,631 as receiver, only 5 as sender.
    • This is a textbook information sink: a central mailbox that many people write to, but that rarely initiates conversation (think of a shared support inbox or escalation mailbox).
  • Other high-degree nodes show complementary patterns:
    • Some have very high out-degree and modest in-degree, acting as broadcasters or announcement mailboxes.
    • Others combine high in- and out-degree and sit in the upper-right quadrant of the role map – these are the true hubs that both collect and redistribute information.
  • The ego-network around the main hub shows a star-like pattern: one central node connected to a tight ring of ~25 high-volume neighbours, with relatively fewer connections among those neighbours.
    • This indicates a hub-and-spoke structure, where the central mailbox coordinates work across otherwise loosely connected groups.

Who emails whom?

  • The chord-style figure focuses on the top 12 influencers by total degree.
  • Each node occupies a slice of the circle; chords show direct email ties between them.
  • Even among this elite, the subgraph is sparse: a few dense pairs exchange messages, while many connections run from or to the main hub.
  • Visually, this makes clear that the core leadership or coordination layer is not a fully-connected clique but a set of specialised hubs with distinct roles.

Business Impact

Operational risk & continuity

  • When 0.1–1% of mailboxes carry over half of all email traffic, the organisation is structurally dependent on a small group of individuals or shared inboxes.
  • These addresses are single points of failure: illness, turnover, or access problems can disrupt large parts of the communication flow.
  • Recommendation: treat high-degree mailboxes as critical infrastructure – set up backup owners, shared access, and clear escalation paths.

Process mapping through communication patterns

  • The roles inferred from in- vs out-degree help map informal processes:
    • High-in / low-out nodes → intake & escalation points (customer support, reporting mailboxes).
    • High-out / low-in nodes → broadcast channels (internal newsletters, HR announcements).
    • High-in / high-out nodes → coordinators / project managers.
  • Even without knowing job titles, management gains a high-level picture of how work actually flows and whether this matches the formal org chart.

Targeted compliance and awareness programmes

  • The same high-degree mailboxes are also where phishing, data-loss, and compliance breaches are most likely to have impact.
  • Instead of a one-size-fits-all security programme, the network view allows:
    • Prioritised training and monitoring for key influencers and hubs.
    • Focused anomaly detection on these nodes (sudden spikes, unusual contacts, off-hours behaviour).

Change-management and communication planning

  • For major strategic announcements or change programmes, the identified hubs are the natural amplifiers:
    • they already have trust and visibility in the network,
    • and their out-degree indicates real reach.
  • Communication and HR teams can use the list of top hubs as a “network of champions” to involve early, brief more deeply, and equip with tailored materials.