Skip to content

The method of extraction

23. May 2012

In order to make commenting or improving the method easier I copied the content here.

Please note that it’s licensed under the X11 license. Since I posted it there.
The license.

I don’t think my document matters in this case. It’s not really documentation. It’s pretty bad looking at least.


Platform: Windows 7 // You’ll see why I mention that

Data for relays: Consensuses of May 2008

// I looked into the server descriptors as well, but they contain more data, which I don’t considered useful for the comparison.

Data for bridges: Statuses of bridges of May 2008

Download consensuses

1. I downloaded the consensuses of May 2008 “consensuses-2008-05.tar.bz2”

2. Unpacked them, what gave me the root folder “consensuses-2008-05” containing 31 folders with 24 files each

Inspect the files // I used Notepad++ since the default notepad has many downsides

3. I opened the file “2008-05-01-00-00-00-consensus”

4. The lines containing the relay name starts with an “r” // Could I use that somehow

5. I considered these lines to be the only useful

Process the files // Since manual copying each line to a new file is slow I used grep for Windows. I had it installed already, but it was rather unused.

// Linux is wonderful here, Windows lacks this function.

6. I found a way to extract the lines containing the relay name by using grep, now I needed a pattern

7. All relay lines contain the year 2008, so I could use this as pattern. Pattern found.

// Using regular expression for the full data might have worked

// Using regular expression for the 27 chars string might have worked

// I used “2008” because it was simple and gave only few false positives

// I tested it with a single file, then the folder “01”

8. Having all relay lines of 24 hours in a single file revealed that at least some of them

where not up 24 hours.

9. I decided to use grep on all files at once to have all relay lines in one file.

// grep is able to do so and I didn’t want to miss any relay

// grep is really fast

10. I used “grep -r -h 2008 X:\concensuses-2008-05 > X:\comparison\dump.txt

// I renamed the file to “relays unsorted uncleaned.txt”

Sort the lines // I considered it useful to sort the lines. Windows isn’t able to sort the content of files.

// Since I used Notepad++ for looking into the files I wanted to use it

// for sorting as well. It can’t do that natively, but there’s a plugin.

11. I decided not to strip the trailing “r”

// It shouldn’t hurt.

12. I used Notepad++ with the plugin “Column Sort” to sort the lines

// that’s time and memory intensive

13. I saved a copy of the sorted data and removed, valid-after, fresh-until, valid-until

and vote digest

// vote digest was included once because that line contained “2008”

// I saved it as “relays sorted cleaned.txt”

// If I would have processed the files manually it would have take far far longer.

// The tools were a great help so far. Considered Linux distributions can do that

// by default it should not be hard to reproduce this.

Try trimming list // To compare nicknames in the first place it should be much easier

// to see the same nickname just once at a time.

// Manually trimming would have worked, but would consume much time.

// To just keep name and fingerprint I wanted to treat it as CSV.

// Just in order to remove data from the files

14. I loaded the sorted copy into a spreadsheet program, but not all lines could be

imported because spreadsheet programs are limited. I therefor had to split the

list first.

// I used LibreOffice 3.5, but Microsoft Excel has a limit amount

// of lines as well.

Split the list // Windows is able to split files, but I don’t know how well.

// I used GSplit, because I knew it could split after x occurrences of a pattern. This includes special characters

// like the Line Feed character. So I could make sure to keep

// the lines itself intact and could choose exactly how many lines

// the files would contain. The first a 1000000, the second the rest.

15. I split the file into two parts by using GSplit

// changed or used settings

// “I want to split after the nth occurrence of a specified pattern”

// “Split after the occurrence number”

// “1000000”

// “0x0A” as this is the LF or Line Feed

// Filename “part{num}.txt”

// “Do not add Gsplit tags to piece files”

Keep the wanted // I considered nickname and fingerprint to be valuable, because

// the fingerprint makes identification easier.

16. I loaded each part in a spreadsheet application. //Calc from LibreOffice 3.5

17. I used spaces as separator and made sure every column is treated as text

// Treating it as text prevents interpretations of the data

// for example “001” will be turned to “1” as the trailing zeros

// will be ignored, treating the data as text prevents this

18. I removed the columns that seemed not to be required and saved each file as CSV

// no commas were added, I ended up with “nickname” “fingerprint

// separated by space, no empty lines in between.

Trim the list // Now both files contained the nick and the fingerprint, but still multiple times

// I wanted to remove the duplicates.

// I used Notepad++ with the TextFX plugin

19. I loaded both files into Notepad++ and used TextFX to sort them, as it can

paste unique lines only.

// In fact TextFX could have done the first sorting as well

// “Sort accending”

// “Sort outputs only unique lines”

// “Sort line case insensitive” no difference between Tor, ToR and tor, the fingerprint prevents those lines from

// not being output.

20. I copied both sorted lists into a new file and removed a single line,

because it appeared twice

// It should have been possible to combine both CSV files before sorting, but that’s matter of memory

21. I discard the changes made to the CSV files // I did not save the changes

22. I saved the new list as “relay names fingerprint.txt”

which now contains 9469 lines. //strange there are not so many relays

// there where never so many relays; did I mess up?

// there are relay names that are the same, but have a different fingerprint

// this explains some occurrences

// I noticed that some fingerprint appeared at least twice, but had different nicknames

// I checked the source data and they where not up at the same time.

// I decided to go on, even though it was strange.

Unnamed relays // Before I started I wondered if Unnamed relays would tell me anything.

// I looked at “Unnamed” and counted them; whole word, match case

// It appeared 3390 times

23. I removed “Unnamed” (case sensitive) and saved as

“relay names fingerprint no unnamed.txt”

// I kept UNNAMED and unnamed as well as Unnamed + any addition

// Should I trim the list further?

24. I loaded the file into Calc and removed the fingerprints

// saved as “relay names only.csv”

25. I sorted the file with Notepad++ and kept the unique names

// “unique relay names only sorted.csv”

// I may lost “Tor, ToR and tor”, but was OK with that

// I was down to 4873 lines

// Back in 2008 there weren’t so many relays

// Should have names changed that often?


26. I downloaded the bridge data

27. “Grep”ed the statuses

// I also used 2008 as pattern, there where no false positives this time

28. Sorted using Column Sort

// I sorted with Column Sort in the first place to have an overview

// Many lines where exact duplicates

// I think it’s useless to do this

29. Sorted again using TextFX unique lines only, saved it

// bridges sorted.txt with only unique bridges and fingerprints

30. Loaded into a spreadsheet application

// Remember to treat it as text

31. Keep only the bridge names

// that’s the only thing needed

32. I checked “Unnamed” and it didn’t vary at all

33. Sorted unique, removed “Unnamed”

// a final time to make sure I had less lines

// “bridges names only unique.csv”

34. Compared them manually (that’s what I agreed to)

// That’s more work as a thought as I saw the bridge list

// The bridge IPs were sanitized but one could tell if they are stable

35. Copied lines I found from “bridges sorted.txt” and “relay sorted cleaned.txt” to “findings.txt”

The files I really worked with are”unique relay names only sorted.csv”,”bridges names only unique.csv”, “bridges sorted.txt” and”relay sorted cleaned.txt”.

I did not know if the other files I created along the way would be useful so I saved them. At least I haven’t used them.

My approach as I planned it would to look at the bridge names and compare them to the relay names. Mainly because there are much more relays.

Would and should my approach be different if there would be 50000 bridges?

I’m sure some call me (something) for not taking a shortcut. I’m sure I could remove or skip a few steps if I know the right tools. Also I’m on Windows.

After I did all this, I was quite sure that this can be done with a script. Some experienced user would be better at this.

What I’m looking for is a improvement on how I approached it. There are plans to compare the names from bridges and relays from an recent tarball.

Maybe it’s even possible to use an algorithm that prints out exact matches.


From → General, Research

Comments are closed.

%d bloggers like this: