Hitlog Processing: Analyzing User Journeys to Identify Influential Content

Introduction & The Problem

Understanding which content drives user engagement and conversions is crucial for any content-driven platform. Imagine you run a news website where users read multiple articles before deciding to register. Which articles are actually influencing that registration decision? Which content pieces are your hidden conversion drivers?

This is the exact problem the Hitlog Processing project solves. Given a stream of user page views (hitlog data) containing article views and registration events, we need to determine which articles are most influential in leading users to register. The challenge lies in accurately tracking user journeys, handling multiple registration events, and avoiding double-counting articles within the same journey.

The solution? A production-ready Python application that processes hitlog CSV files and ranks articles by their influence on user registrations. Think of it as an attribution system that tells you: "This article was viewed by 47 unique users before they registered."

Project Structure & Distributed Python Architecture

One of the key strengths of this project is its clean, modular architecture following Python best practices for production-ready applications. Let's examine the structure:

hitlog_processing/
├── data/
│   ├── data_gen.py              # Synthetic data generator
│   ├── logs/                    # Input CSV files
│   └── outputs/                 # Generated results
├── notebooks/
│   ├── data_exploration.ipynb   # EDA and journey analysis
│   └── solutions/               # Algorithm prototypes
├── src/
│   └── telegraph_ranker/
│       ├── cli.py               # Command-line interface
│       ├── domain.py            # Type definitions
│       ├── io_utils.py          # CSV reading/writing
│       ├── approaches/
│       │   ├── timestamp_based.py
│       │   └── graph_based.py
│       └── models/
│           └── node.py          # Graph node representation
├── tests/
│   ├── conftest.py              # Pytest fixtures
│   └── test_cli.py              # Integration tests
├── Makefile                     # Development automation
├── pyproject.toml               # Project configuration
└── requirements.txt             # Pinned dependencies

Advantages of This Distributed Format

Separation of Concerns

Each module has a single, well-defined responsibility. CLI logic is separate from algorithms, which are separate from I/O operations.

Testability

Unit tests can target specific modules without invoking the entire system. Pure functions make testing deterministic and fast.

Reusability

Core algorithms in approaches/ can be imported and reused in notebooks, web APIs, or other applications without coupling to CLI.

Scalability

Adding new ranking approaches is as simple as creating a new file in approaches/ with a build_ranking() function.

Tools & Technologies Used

Python 3.13

Latest Python with modern type hints, dataclasses with slots, and performance improvements.

Pandas

Efficient CSV I/O, data manipulation, grouping operations, and sorting for large datasets.

NetworkX

Graph-based analysis with custom Node models for representing user journey flows.

Pytest

Comprehensive test suite with fixtures for integration testing and result validation.

Ruff

Lightning-fast linting and formatting for consistent code style and quality checks.

UV Package Manager

Modern, fast dependency resolution and virtual environment management.

Two Algorithmic Approaches

The project implements two distinct solutions to the same problem, each with different implementation strategies but producing identical results.

1. Timestamp-Based Approach

This is the default and simpler implementation. It processes events chronologically for each user, maintaining a set of unique articles seen in the current "journey."

def build_ranking(df: pd.DataFrame) -> pd.DataFrame:
    weights: dict[str, int] = {}
    names: dict[str, str] = {}

    for _uid, grp in df.groupby("user_id", sort=False):
        seen_in_journey: set[str] = set()

        for _, row in grp.iterrows():
            url = row["page_url"]
            if url.startswith(ARTICLE_PREFIX):
                if url not in seen_in_journey:
                    seen_in_journey.add(url)
                    names.setdefault(url, row["page_name"])

            if url == REG_URL:
                # Commit +1 to all articles in journey
                for art_url in seen_in_journey:
                    weights[art_url] = weights.get(art_url, 0) + 1
                seen_in_journey.clear()  # Reset for next journey

    # Build and sort output DataFrame
    ...

Key Features:

Simple, direct implementation
O(n) complexity where n = total events
Low memory overhead
Easy to debug and reason about
Deterministic results

2. Graph-Based Approach

This approach builds a directed graph where nodes represent pages and edges represent user navigation patterns. It's more complex but extensible for future graph analytics.

def build_ranking(df: pd.DataFrame) -> pd.DataFrame:
    # Build Node objects for all pages
    nodes = _build_nodes(df)
    
    # Create directed edges between consecutive pages
    _link_edges(df, nodes)
    
    # Apply journey-based weight accumulation
    _accumulate_weights(df, nodes)
    
    # Extract article nodes with positive weights
    ...

Key Features:

Graph structure enables PageRank, centrality analysis
O(n + e) complexity where n = events, e = edges
Stores navigation patterns for further analysis
Extensible architecture for advanced metrics
Custom Node dataclass with slots for efficiency

Note: Both approaches produce identical results for influence ranking. The test suite validates this equivalence. Choose timestamp-based for simplicity, graph-based when you need the graph structure for additional analyses.

Testing & Code Quality

Quality assurance is built into the development workflow through automated testing and code quality checks.

Test Coverage

The test suite uses pytest to validate both approaches against the same sample data, ensuring output consistency:

def test_cli_timestamp_and_graph(sample_csv_path, outputs_dir, run_cli_env):
    # Run both approaches
    _run_cli(sample_csv_path, out_ts, "timestamp", run_cli_env)
    _run_cli(sample_csv_path, out_gr, "graph", run_cli_env)

    # Validate schema and data
    df_ts = _assert_schema_and_nonempty(out_ts)
    df_gr = _assert_schema_and_nonempty(out_gr)

    # Ensure results match
    assert totals_timestamp == totals_graph

Automated Quality Pipeline

The Makefile provides convenient targets for maintaining code quality:

# Format code with Ruff
make format

# Run linter
make lint

# Run test suite
make test

# Complete pre-commit pipeline: format + lint + test
make commit

Usage Example

The CLI makes it easy to process hitlog data and generate influence rankings:

# Install the package
pip install -e .

# Run with timestamp approach (default)
python -m telegraph_ranker.cli \
  --input data/logs/hitlog_2025-10-27.csv \
  --output data/outputs/influence_timestamp.csv \
  --approach timestamp

# Run with graph approach
python -m telegraph_ranker.cli \
  --input data/logs/hitlog_2025-10-27.csv \
  --output data/outputs/influence_graph.csv \
  --approach graph

Sample Output

The output CSV contains ranked articles sorted by influence score:

page_name,page_url,total
"Breaking: Major Policy Announcement",/articles/major-policy-announcement,47
"Tech Giants Face New Regulations",/articles/tech-regulations,42
"Economic Outlook 2025",/articles/economic-outlook-2025,38
"Climate Summit Reaches Agreement",/articles/climate-summit,35
...

Data Generation & Testing

The project includes a synthetic data generator (data/data_gen.py) that creates realistic user journeys for testing and demonstration. It generates:

20 simulated users
50 articles across 10 categories
Realistic timestamp sequences
Multiple registration events per user
Deterministic output (seeded random)

Key Learnings & Best Practices

Architecture Decisions

Separate pure functions from I/O operations
Use type hints for better IDE support and documentation
Leverage dataclasses with slots for performance
Keep algorithms independent of data sources

Development Workflow

Use Makefile for automating repetitive tasks
Combine linting, formatting, and tests in CI pipeline
Maintain both notebooks (exploration) and production code
Write integration tests that validate CLI behavior

Future Enhancements

Potential extensions to explore:

PageRank-based scoring: Use graph structure to compute article importance beyond simple counting
Time decay weights: Give more weight to articles viewed closer to registration
Multi-touch attribution: Implement first-touch, last-touch, or linear attribution models
Real-time processing: Stream processing for live hitlog data
Visualization dashboard: Interactive charts showing influence graphs and user flows

Conclusion

The Hitlog Processing project demonstrates how to build a production-ready Python application with clean architecture, comprehensive testing, and multiple algorithmic approaches. By separating concerns, writing testable code, and following Python best practices, we've created a maintainable solution that can easily scale and evolve.

Whether you're analyzing content influence, building attribution systems, or just learning about user journey analysis, this project provides a solid foundation and real-world example of professional Python development.

View the Full Project

Explore the complete source code, documentation, and examples on GitHub:

GitHub Repository

Next up: Scaling this to real-time stream processing with Apache Kafka! Stay tuned!