(Almost) End to End Log File Analysis with Python
Often seen as a nuanced exercise and a buzzword to sound extra technical, log file analysis serves to bring data into a conversation of assumptions. Trying to understand bot crawl behavior through third party tooling such as Google Search Console and Bing Webmaster Tools alone only gives you part of the story. In instances where you want to conduct SEO experimentation with internal linking or you want to see crawl behavior changes with a site migration, you will need the level of granularity log files provide. If you are looking to scale up your understanding of crawl behavior, whatever the reason, you will want to make sure you have the tools at your disposal to do so.
* Note that this is not an article on the value of log file analysis (of where there are many floating around), but if you are looking for a place to start I would recommend the following:
- Moz’s White Board Friday on Log File Analysis
- Jet Octopus’ Log File Analysis Checklist
- Botify’s Log File Analysis 101
Why Shouldn’t I Just Use SEMRush/Screaming Frog/JetOctopus/Botify and Excel?
Jean said it best on his comment from the White Board Friday
Often, especially from an enterprise standpoint, trying to leverage third party tools for log file parsing and data analysis falls into the trap of scale. Working with any brand who wants to do log file analysis at scale will be immediately stopped the second they want to look at more than a day’s worth of data. For one time, ad-hoc analysis using a Screaming Frog or SEMRush for parsing and Excel for analysis works fine. If you are a brand leveraging Botify for log file analysis, you are most likely in the camp of having a well off budget to bring in both the tool as well as their internal team to assist. Going through the exercise of navigating UI’s, drag and dropping files and waiting for the tool to load results to then have to manually export also adds to the overall time suck of the process, and that’s without having Excel ultimately crash when you try and open the file. If you are looking to do work on an enterprise tool in a custom solution without having to spend incrementally to do so, utilizing Python is probably the best way to go.
Extracting the Log File Itself Within Python
Not the most exciting part, but I found when requesting log files across a larger timeframe, I would end up with files types beyond .log. For work in the past, I have seen files come back as both .log and .gz. Accessing .log files can be done with a simple with open() statement. For a .gz file, you can use the library gzip and use a with gzip.open(‘file.gz’) statement, similar to a .log file.
Parsing Separate Logs from a Log File
Looking at a log file, you would expect that a simple .split( ) would allow for a clean, parseable file, but in practicality, log files do not always maintain the same consistency every time (even though the format stays the same, sometimes the server logs actions which omit a specific element of the file itself). Instances, where certain aspects of the log are omitted, can through a .split( ) approach completely off. Instead, I have relied on using a series of regexes to isolate elements necessary to analyze log files from an SEO standpoint. These have all been consolidated into the function log_parse().
Running The Script and Data Storage
Once you have your log files stored in an accessible area, you can run the function against your log file in a loop (in the instance below I’m using a nested loop to parse multiple log files in one run).
After parsing your file, there are multiple approaches you can take to store your data based on the scale and needs of your site. If you are planning on leveraging this approach in the short term or for a smaller site, you may want to store locally in a pickle. For building processes on a larger scale, you should rely on migrating data into a SQL server, a BigQuery instance or your preferred data warehousing solution.
Googlebot Verification
When going through your log files, you will want to make sure that you are only pulling out valid crawl data from Googlebot/Bingbot/your bot of choice. One approach is to validate via reverse DNS lookup directly in Python within your DataFrame. To do this, you can run a lambda function against the IP address in your DataFrame.
Alternatively, you can use CrawlerDetect, a third-party library for validating crawlers.
Pivoting & Plotting
Once you have your data in a manageable form, then it’s time to analyze. You can take your newly filtered DataFrame and go through an series of pivots.
Distribution by Status Code
Top Files Crawled
Crawls by Date
Conclusion
Logfile analysis often can be a separator between theory and reality how Google and other search engines understand your sites. Taking the time to scale these types of analyses to understand the correlative effects of SEO changes/experiments on search engine behavior can quickly give your team the upper hand with all technical SEO programs moving forward. Leveraging Python for log file analysis allows for the most seamless approach to gain quick, continuous insight into your SEO initiatives without having to rely on manual tool configuration and big data issues which most tools cannot handle for.
Want to talk more about log files? Trying to figure out specific problems in your own implementation of Python-based log file analysis? Feel free to reach out to me directly via my LinkedIn or Twitter!
Credit is Where Credit is Due
There are plenty of other Python practitioners who had tackled this same problem with their unique approaches. Here are some of the resources I consulted as I approached building out my log file analysis systems that I would recommend you read if you are still interested in learning more!
- Log File Analysis with Cloudflare and Google Data Studio by Suganthan Mohanadasan
- Leveraging Python and Google Cloud to Extract Meaningful SEO insights from server log data by Charly Warginer
- How I Analyzed Millions of Log Files with Python by SEO Garden