Progress of comparison

17. May 2012

I started before the bridge data got released to see how I obtain the data.

Considered that I haven’t worked with those data and have no experience in extracting data from a larger pool of data I found my “solutions” at least good enough. After I finished processing the relay data, I came to the conclusion that someone with better tools would be able to do that a lot faster. I guess someone with scripting language skills does the extraction process much faster and therefor more efficient.

Yesterday the bridge data got released. You can see that on the mailinglist.

I downloaded them yesterday, but did not touch them. Today was a holiday so I started unpacking, inspecting and finally processes them. I’m not good at time tracking and don’t know how much time I spent on processing the data. The overall time I spent on relay and bridge data can be guessed from four to five hours. The bridge data could be processed faster because it weren’t so many and I knew what I had to do. I guess it could be faster when I re-do the whole processing, since I know what I did. Maybe it can be optimized one way or the other. Ideally with a script or something.

The really time consuming part should be the comparison itself. That may spoiler something, but to what I have seen so far, most similarities would have been found by a simple tool or script. At least, that’s what I guess.

The comparison as far as I’m done with it was pretty easy. I took list A and compared it to list B for exact matches or close names. Copying the data into a new file was more time consuming as I would have thought. I’m going to revisit the file and see if I have to improve my findings and/or how I display them. From what it looks in that state I’m not going to find many more similarities. After all I was surprised that relays and bridges had names that matched exactly.  However some of them appear to be just tests, to see how a bridge performs.

I’m not fully convinced that the in-between data I created from step to step will be all that useful for anyone. My initial plan was to make the process public and put all the data somewhere, to check if I did something very wrong. My findings will be published on the mailinglist and once they looked into it, they report how many, at least in percent, of the names I considered to be similar, have IP addresses that are close to each other. Additional data eventually coming, but I’m not sure yet.

When you see what I did and how… I don’t know. I’m glad that I had the ideas on how to do this and that, but some of you may laugh or start crying. Once again I assume that a script would do that much faster.


