atechdad make it so

Mass Identification of WordPress Versions

In case you haven’t noticed, I recently migrated my blog to Jekyll. I did this as a response to the meltdowns I experienced the last time I was on the front page of HN. I could have upgraded my server of course, but I’m stubborn. Besides, Jekyll gave me an opportunity to learn something new.

Anyway, while going through the migration exercise, I became curious as to what the current WordPress install-base looks like. As it often does, general curiosity gave way to brainstorming a method of doing this kind of check. I figured that if I could use a fingerprint for a WordPress install, I could again comb through the scans.io http dumps. So that’s what I did.

To better answer this question, I figured I’d also need to check some historical records and compare them to recent dumps.

I used the following process to collect this data:

  1. Download each http dump from 09/2013 until 08/2015
    • I got a list of the dumps from [scans.io site] (https://scans.io/json) and parsed out each sonar.http entry.
    • I downloaded each of them. Since the disk space is severely limited on my VPS, I had to comb through the data as it streamed down. This added a lot of time– the overall process took about 10 days to get everything I was looking for. I parsed the data looking for three things.
      1. The date of the scan
      2. The IP of the server. I did this so I could track versions over time by source.
      3. The version.
  2. Parse each dump and look for a marker that identifies the page as a WordPress installation
    • Since most unmodified WordPress contain the version in a meta tag, I got the value by looking for the regex “content="WordPress [0-9].[0-9].?[0-9]?”
    • Run reports on data - cat log | sort |uniq -c works, but I wanted to try something new.
    • Send the data to Splunk
      • I decided on Splunk to get more exposure. Lots of companies use it. I set up a Splunk server for this using their free licensing model, but that is outside the scope of this.
    • Export the parsed data into a format easy for me to configure Splunk to process. I opted for json.

Problems with my methods

  • This process doesn’t include https. I get this– and realize that I am missing a large chunk of data. There’s still a lot to be analyzed here so I am good with this for now.
  • This doesn’t include custom configs that have excluded the WordPress metatags. I am making the assumption that this an exception to the rule.
  • Custom version values in the metatags can pollute the data. Another aSSumption…

The Results:

Here are some of the reports I’ve run so far:

Count of WordPress installs on individual IPs as of August 2015

Query: index=wp | dedup ip | stats count

Count of WP installs

There appear to be 386,357 installs which fit the criteria above.

Top 20 wordpress versions as of August 2015

Query: index=wp| top limit=20 version

Top WP installs

At the time of the last scan, 4.2.4 is the clear winner. However, it was surprising to me that there are some early version 3s in the top 20.

Top 20 WordPress versions over time (9/1/2013 to 8/1/2015)

Query: index=wp| timechart count by version limit=20 useother=f

Top WP installs over time

What’s interesting about this one is that you can see clear spikes when new versions are released. It also appears that there is generally about a 3 month overlap between a release and its major successor’s rise.

Or a pretty picture of the top 30 in the same date range just for fun–or execs…

Query: index=wp| timechart count by version limit=30 useother=f (using the area visualization)

Top 30 WP installs over time

Other ideas for reports:

  • Versions with the least amount of change by IP. This could indicate some kind of canned WordPress site– maybe.
  • Exposure based on versions with known vulnerabilities.

The reports can go on and on, but you get the idea.

Overall I learned a bit more about Splunk and got a clearer picture of the state of WordPress installs.

If you have any questions, let me know.