Caching strategy

List overview All Threads
Download

newer

older

New stable release "3.0.20190722"

Re: [pywikibot] Caching strategy

Huji Lee

12 Jul 2019 12 Jul '19

2:43 a.m.

Hi all,

I am working on a bot that fetches a list of anonymous editors on fawiki, uses WHOIS to retrieve more info about that IP, and uses a number of online APIs to check if the IP is a proxy or not.[1]

I would like to improve the code by implementing a CIDR cache, so that if I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 and then I encounter 8.4.4.9 in the next iteration of my for loop, I would quickly determine this IP also belongs to the same range and skip the WHOIS part for it.

The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25" (these are the beginning and end IP addresses of the 8.4.4.0/24 range). Obviously, we can convert these IPs to Hex to make comparisons easier. Given an IP like 8.4.4.9, we need the object to efficiently determine if it already has an IP range that encompasses this given IP and if so, return the previously cached details for that IP pair. If not, we will store that in cache.

The part that I am not fully clear about is the following: how can I avoid having to loop through every range in the cache? Is there a way to create a hash function that checks two inequality comparisons efficiently?

Thanks!

Huji

[1] https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/findproxy....

Attachments:

attachment.htm (text/html — 2.5 KB)

Show replies by date

Lucas Werkmeister

12 Jul 12 Jul

7:05 a.m.

You probably want to use a trie https://en.wikipedia.org/wiki/Trie for this – I found several available Python implementations, but I don’t know what their advantages or disadvantages are, so I’ll just list them in alphabetical order:

* cidr-tree https://github.com/Figglewatts/cidr-trie * py-radix https://github.com/Figglewatts/cidr-trie * pysubnettree https://github.com/zeek/pysubnettree * pytricia https://github.com/jsommers/pytricia

Cheers, Lucas

On 12.07.19 04:43, Huji Lee wrote:

...

Hi all,

I am working on a bot that fetches a list of anonymous editors on fawiki, uses WHOIS to retrieve more info about that IP, and uses a number of online APIs to check if the IP is a proxy or not.[1]

I would like to improve the code by implementing a CIDR cache, so that if I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 http://8.4.4.0/24 and then I encounter 8.4.4.9 in the next iteration of my for loop, I would quickly determine this IP also belongs to the same range and skip the WHOIS part for it.

The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25" (these are the beginning and end IP addresses of the 8.4.4.0/24 http://8.4.4.0/24 range). Obviously, we can convert these IPs to Hex to make comparisons easier. Given an IP like 8.4.4.9, we need the object to efficiently determine if it already has an IP range that encompasses this given IP and if so, return the previously cached details for that IP pair. If not, we will store that in cache.

The part that I am not fully clear about is the following: how can I avoid having to loop through every range in the cache? Is there a way to create a hash function that checks two inequality comparisons efficiently?

Thanks!

Huji

[1] https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/findproxy....

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

Strainu

8:11 p.m.

I'm very curious if you can run at Wikipedia scale with such a trie in memory on a normal computer (e.g. with only tens of GiB of memory). Please let us know if you actually get this into production (or just submit the script for inclusion in the framework, it sounds really useful)

Strainu

Pe vineri, 12 iulie 2019, Lucas Werkmeister mail@lucaswerkmeister.de a scris:

...

You probably want to use a trie https://en.wikipedia.org/wiki/Trie for this – I found several available Python implementations, but I don’t know what their advantages or disadvantages are, so I’ll just list them in alphabetical order:

cidr-tree https://github.com/Figglewatts/cidr-trie

py-radix https://github.com/Figglewatts/cidr-trie

pysubnettree https://github.com/zeek/pysubnettree

pytricia https://github.com/jsommers/pytricia

Cheers, Lucas On 12.07.19 04:43, Huji Lee wrote:

Hi all,

I am working on a bot that fetches a list of anonymous editors on fawiki, uses WHOIS to retrieve more info about that IP, and uses a number of online APIs to check if the IP is a proxy or not.[1]

I would like to improve the code by implementing a CIDR cache, so that if I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 and then I encounter 8.4.4.9 in the next iteration of my for loop, I would quickly determine this IP also belongs to the same range and skip the WHOIS part for it.

The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25" (these are the beginning and end IP addresses of the 8.4.4.0/24 range). Obviously, we can convert these IPs to Hex to make comparisons easier. Given an IP like 8.4.4.9, we need the object to efficiently determine if it already has an IP range that encompasses this given IP and if so, return the previously cached details for that IP pair. If not, we will store that in cache.

The part that I am not fully clear about is the following: how can I avoid having to loop through every range in the cache? Is there a way to create a hash function that checks two inequality comparisons efficiently?

Thanks!

Huji

[1] https://github.com/PersianWikipedia/fawikibot/ blob/master/HujiBot/findproxy.py

pywikibot mailing listpywikibot@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/pywikibot

Maarten Dammers

13 Jul 13 Jul

2:09 p.m.

Hi Huij,

In my day job I'm a network engineer. Nothing smaller than a /24 gets routed on the internet. I would just do a quick and dirty approach: Ignore the last octet. So cache based on /24. If you want to go more complicated you can loose the length. An ipv4 address is 32 bit. A /24 says: Network is 24 bits and the host part is 8 bits. So for a /23 it's 23 bits of network and 9 bits of host. It's always on the bit boundary so a /24 is alway from 0 (network) to 255 (broadcast). Just Google a bit to find posts like https://learningnetwork.cisco.com/blogs/vip-perspectives/2014/05/15/network-... . So comparison is very easy and very efficient.

How are you going to deal with providers that announce large chunks of ip space (like a /13) that are used for all sorts of things? I assume you want to use INET objects and not ROUTE objects? Be aware that mass harvesting of databases like RIPE isn't allowed. Also the quality of these objects differ greatly depending on the LIR/country/RIR.

Maarten

On 12-07-19 04:43, Huji Lee wrote:

...

Hi all,

I am working on a bot that fetches a list of anonymous editors on fawiki, uses WHOIS to retrieve more info about that IP, and uses a number of online APIs to check if the IP is a proxy or not.[1]

I would like to improve the code by implementing a CIDR cache, so that if I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 http://8.4.4.0/24 and then I encounter 8.4.4.9 in the next iteration of my for loop, I would quickly determine this IP also belongs to the same range and skip the WHOIS part for it.

The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25" (these are the beginning and end IP addresses of the 8.4.4.0/24 http://8.4.4.0/24 range). Obviously, we can convert these IPs to Hex to make comparisons easier. Given an IP like 8.4.4.9, we need the object to efficiently determine if it already has an IP range that encompasses this given IP and if so, return the previously cached details for that IP pair. If not, we will store that in cache.

The part that I am not fully clear about is the following: how can I avoid having to loop through every range in the cache? Is there a way to create a hash function that checks two inequality comparisons efficiently?

Thanks!

Huji

[1] https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/findproxy....

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

Strainu

15 Jul 15 Jul

12:42 p.m.

Pe sâmbătă, 13 iulie 2019, Maarten Dammers maarten@mdammers.nl a scris:

...

Hi Huij,

In my day job I'm a network engineer. Nothing smaller than a /24 gets routed on the internet. I would just do a quick and dirty approach: Ignore the last octet. So cache based on /24. If you want to go more complicated you can loose the length. An ipv4 address is 32 bit. A /24 says: Network is 24 bits and the host part is 8 bits. So for a /23 it's 23 bits of network and 9 bits of host. It's always on the bit boundary so a /24 is alway from 0 (network) to 255 (broadcast). Just Google a bit to find posts like https://learningnetwork.cisco.com/blogs/vip-perspectives/ 2014/05/15/network-binary-math-explained . So comparison is very easy and very efficient.

IPv4 is easy, you can just go with a bit map and be done with it on a decent pc. The problems are IPv6 and usage fragmentation, as described below.

...

How are you going to deal with providers that announce large chunks of ip space (like a /13) that are used for all sorts of things? I assume you want to use INET objects and not ROUTE objects? Be aware that mass harvesting of databases like RIPE isn't allowed. Also the quality of these objects differ greatly depending on the LIR/country/RIR.

I suspect the other apis used in the script are going to split these networks a lot, thus my concern with running a trie at Wikipedia scale. Maybe there's a way to split the ipv6 space just enough to be feasible to use a bitmap as well?

...

Maarten On 12-07-19 04:43, Huji Lee wrote:

Hi all,

I am working on a bot that fetches a list of anonymous editors on fawiki, uses WHOIS to retrieve more info about that IP, and uses a number of online APIs to check if the IP is a proxy or not.[1]

I would like to improve the code by implementing a CIDR cache, so that if I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 and then I encounter 8.4.4.9 in the next iteration of my for loop, I would quickly determine this IP also belongs to the same range and skip the WHOIS part for it.

The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25" (these are the beginning and end IP addresses of the 8.4.4.0/24 range). Obviously, we can convert these IPs to Hex to make comparisons easier. Given an IP like 8.4.4.9, we need the object to efficiently determine if it already has an IP range that encompasses this given IP and if so, return the previously cached details for that IP pair. If not, we will store that in cache.

The part that I am not fully clear about is the following: how can I avoid having to loop through every range in the cache? Is there a way to create a hash function that checks two inequality comparisons efficiently?

Thanks!

Huji

[1] https://github.com/PersianWikipedia/fawikibot/ blob/master/HujiBot/findproxy.py

pywikibot mailing listpywikibot@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/pywikibot

1968

Age (days ago)

1971

Last active (days ago)

pywikibot@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

Huji Lee
Lucas Werkmeister
Maarten Dammers
Strainu