Python compare two lists or more.

Welcome back readers, I’ve missed you! I’ve been busy with major project upgrades: LARG*ncm is stable and working well via HTTPS! LARG*feed’s current pruned database contains about 500,000 attackers with a 30 day expiration which amounts to just shy of 200,000 high confidence IPs, aka each one attacked us directly, either into a honeypot, firewall log, or DDOS.

I considered adding data from other threat feeds but we don’t currently accept any data directly from a single threat feed. Some feeds are full of addresses that would never get past our whitelists, let alone be high confidence. How do you even know which feeds are excellent and which are not?

Analysing these threat feeds is a valuable skill because you can look at how exceptional feeds work and use that data to improve your own. So how do you figure it out? The quickest way is to check for overlapping entries. If multiple threat feeds say an IP is bad, you are much more confident that it is bad so you want to check as many feeds as possible. Just don’t get discouraged because overlaps are statistically rare given the size of the internet.

Using python to compare 2 lists is easy:

threatfeed1 = ["1.1.1.1", "3.3.3.3"]

threatfeed2 = ["2.2.2.2", "3.3.3.3"]

set(threatfeed1).intersection(threatfeed2)

The result, {'3.3.3.3'}, is exactly what you expect. This comparison tool is built right into python so it’s a simple and fast solution. This is the solution if you only have 2 lists but my goal is N more:

threatfeed1 = ["1.1.1.1", "3.3.3.3"]

threatfeed2 = ["2.2.2.2", "3.3.3.3"]

threatfeed3 = ["4.4.4.4", "5.5.5.5"]

set(threatfeed1).intersection(threatfeed2, threatfeed3)

Unfortunately the answer I get is blank when I add this third list. I still want 3.3.3.3 to be in the answer but the compare function looks for the same address in all feeds. I can certainly see the value in that check but it doesn’t suit my specific purpose in this case and as far as I can tell, there’s nothing built into python that does so it’s time to improvise.

threatfeed1 = ["1.1.1.1", "3.3.3.3"]

threatfeed2 = ["2.2.2.2", "3.3.3.3"]

threatfeed3 = ["4.4.4.4", "5.5.5.5"]

threatfeedlists = [threatfeed1, threatfeed2, threatfeed3]

currentlist = []

currentlistindex = 1

threatlist = [] # to store any discovered intersections

counter = 2 # starts at 2 because it will load up 2 lists to start

for listentry in threatfeedlists: # loading up the lists to compare on just the first run

if currentlistindex == 1:

currentlist = listentry

currentlistindex += 1

else:

currentlist = listentry

# First run, load first list, else load the next one

while counter <= len(threatfeedlists):

for intersectbadguy in set(currentlist).intersection(threatfeedlists[counter - 1]):

threatlist.append(intersectbadguy)

counter += 1

currentlistindex += 1

counter = currentlistindex

for badguy in threatlist:

print(badguy)

The code is fairly simple: it loads up list1 and 2 and compares on the first run then moves on to compare list1 and list3. The index increases when list1 has been checked against all other lists so it moves on to compare list2 and list3. There’s also no need for list 3 to go back since it has already been checked against all other lists. The list of feeds can be as long as you please and you end up with any overlaps that exist.

This first step of analysis tells us that about 5,000 out of >100,000 possible entries are on more than 1 list. Once you process all of the different threatfeed lists, you should compare your overlap list with your whitelist. I usually see a high percentage of the overlaps on the whitelist, likely because attackers are IP spoofing to insert bad data into feeds.

I’ve learned that attackers are localised and threatfeeds cannot discover everything. There are overlaps due to the sheer size of the internet so while each list can have great data, we need to investigate further. One excellent way is by using the AbuseIP database. We have a superb rating reporting attacks against LARG*net and can leverage their API to score attackers. More on this in an upcoming blog!

LARG*netComment