Understanding which content drives user engagement and conversions is crucial for any content-driven platform. Imagine you run a news website where users read multiple articles before deciding to register. Which articles are actually influencing that registration decision? Which content pieces are your hidden conversion drivers?
This is the exact problem the Hitlog Processing project solves. Given a stream of user page views (hitlog data) containing article views and registration events, we need to determine which articles are most influential in leading users to register. The challenge lies in accurately tracking user journeys, handling multiple registration events, and avoiding double-counting articles within the same journey.
The solution? A production-ready Python application that processes hitlog CSV files and ranks articles by their influence on user registrations. Think of it as an attribution system that tells you: "This article was viewed by 47 unique users before they registered."
One of the key strengths of this project is its clean, modular architecture following Python best practices for production-ready applications. Let's examine the structure:
hitlog_processing/
├── data/
│ ├── data_gen.py # Synthetic data generator
│ ├── logs/ # Input CSV files
│ └── outputs/ # Generated results
├── notebooks/
│ ├── data_exploration.ipynb # EDA and journey analysis
│ └── solutions/ # Algorithm prototypes
├── src/
│ └── telegraph_ranker/
│ ├── cli.py # Command-line interface
│ ├── domain.py # Type definitions
│ ├── io_utils.py # CSV reading/writing
│ ├── approaches/
│ │ ├── timestamp_based.py
│ │ └── graph_based.py
│ └── models/
│ └── node.py # Graph node representation
├── tests/
│ ├── conftest.py # Pytest fixtures
│ └── test_cli.py # Integration tests
├── Makefile # Development automation
├── pyproject.toml # Project configuration
└── requirements.txt # Pinned dependencies
Each module has a single, well-defined responsibility. CLI logic is separate from algorithms, which are separate from I/O operations.
Unit tests can target specific modules without invoking the entire system. Pure functions make testing deterministic and fast.
Core algorithms in approaches/ can be imported and reused in notebooks,
web APIs, or other applications without coupling to CLI.
Adding new ranking approaches is as simple as creating a new file in
approaches/ with a build_ranking() function.
Latest Python with modern type hints, dataclasses with slots, and performance improvements.
Efficient CSV I/O, data manipulation, grouping operations, and sorting for large datasets.
Graph-based analysis with custom Node models for representing user journey flows.
Comprehensive test suite with fixtures for integration testing and result validation.
Lightning-fast linting and formatting for consistent code style and quality checks.
Modern, fast dependency resolution and virtual environment management.
The project implements two distinct solutions to the same problem, each with different implementation strategies but producing identical results.
This is the default and simpler implementation. It processes events chronologically for each user, maintaining a set of unique articles seen in the current "journey."
def build_ranking(df: pd.DataFrame) -> pd.DataFrame:
weights: dict[str, int] = {}
names: dict[str, str] = {}
for _uid, grp in df.groupby("user_id", sort=False):
seen_in_journey: set[str] = set()
for _, row in grp.iterrows():
url = row["page_url"]
if url.startswith(ARTICLE_PREFIX):
if url not in seen_in_journey:
seen_in_journey.add(url)
names.setdefault(url, row["page_name"])
if url == REG_URL:
# Commit +1 to all articles in journey
for art_url in seen_in_journey:
weights[art_url] = weights.get(art_url, 0) + 1
seen_in_journey.clear() # Reset for next journey
# Build and sort output DataFrame
...
Key Features:
This approach builds a directed graph where nodes represent pages and edges represent user navigation patterns. It's more complex but extensible for future graph analytics.
def build_ranking(df: pd.DataFrame) -> pd.DataFrame:
# Build Node objects for all pages
nodes = _build_nodes(df)
# Create directed edges between consecutive pages
_link_edges(df, nodes)
# Apply journey-based weight accumulation
_accumulate_weights(df, nodes)
# Extract article nodes with positive weights
...
Key Features:
Quality assurance is built into the development workflow through automated testing and code quality checks.
The test suite uses pytest to validate both approaches against the same sample data, ensuring output consistency:
def test_cli_timestamp_and_graph(sample_csv_path, outputs_dir, run_cli_env):
# Run both approaches
_run_cli(sample_csv_path, out_ts, "timestamp", run_cli_env)
_run_cli(sample_csv_path, out_gr, "graph", run_cli_env)
# Validate schema and data
df_ts = _assert_schema_and_nonempty(out_ts)
df_gr = _assert_schema_and_nonempty(out_gr)
# Ensure results match
assert totals_timestamp == totals_graph
The Makefile provides convenient targets for maintaining code quality:
# Format code with Ruff
make format
# Run linter
make lint
# Run test suite
make test
# Complete pre-commit pipeline: format + lint + test
make commit
The CLI makes it easy to process hitlog data and generate influence rankings:
# Install the package
pip install -e .
# Run with timestamp approach (default)
python -m telegraph_ranker.cli \
--input data/logs/hitlog_2025-10-27.csv \
--output data/outputs/influence_timestamp.csv \
--approach timestamp
# Run with graph approach
python -m telegraph_ranker.cli \
--input data/logs/hitlog_2025-10-27.csv \
--output data/outputs/influence_graph.csv \
--approach graph
The output CSV contains ranked articles sorted by influence score:
page_name,page_url,total
"Breaking: Major Policy Announcement",/articles/major-policy-announcement,47
"Tech Giants Face New Regulations",/articles/tech-regulations,42
"Economic Outlook 2025",/articles/economic-outlook-2025,38
"Climate Summit Reaches Agreement",/articles/climate-summit,35
...
The project includes a synthetic data generator (data/data_gen.py) that creates realistic user
journeys for testing and demonstration. It generates:
Potential extensions to explore:
The Hitlog Processing project demonstrates how to build a production-ready Python application with clean architecture, comprehensive testing, and multiple algorithmic approaches. By separating concerns, writing testable code, and following Python best practices, we've created a maintainable solution that can easily scale and evolve.
Whether you're analyzing content influence, building attribution systems, or just learning about user journey analysis, this project provides a solid foundation and real-world example of professional Python development.
Explore the complete source code, documentation, and examples on GitHub:
GitHub RepositoryNext up: Scaling this to real-time stream processing with Apache Kafka! Stay tuned!