Hi Huij,
In my day job I'm a network engineer. Nothing smaller than a /24 gets
routed on the internet. I would just do a quick and dirty approach:
Ignore the last octet. So cache based on /24. If you want to go more
complicated you can loose the length. An ipv4 address is 32 bit. A /24
says: Network is 24 bits and the host part is 8 bits. So for a /23 it's
23 bits of network and 9 bits of host. It's always on the bit boundary
so a /24 is alway from 0 (network) to 255 (broadcast). Just Google a bit
to find posts like
https://learningnetwork.cisco.com/blogs/vip-perspectives/2014/05/15/network…
. So comparison is very easy and very efficient.
How are you going to deal with providers that announce large chunks of
ip space (like a /13) that are used for all sorts of things? I assume
you want to use INET objects and not ROUTE objects? Be aware that mass
harvesting of databases like RIPE isn't allowed. Also the quality of
these objects differ greatly depending on the LIR/country/RIR.
Maarten
On 12-07-19 04:43, Huji Lee wrote:
Hi all,
I am working on a bot that fetches a list of anonymous editors on
fawiki, uses WHOIS to retrieve more info about that IP, and uses a
number of online APIs to check if the IP is a proxy or not.[1]
I would like to improve the code by implementing a CIDR cache, so that
if I run whois on 8.4.4.8 and determine that its ASN range is
8.4.4.0/24 <http://8.4.4.0/24> and then I encounter 8.4.4.9 in the
next iteration of my for loop, I would quickly determine this IP also
belongs to the same range and skip the WHOIS part for it.
The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25"
(these are the beginning and end IP addresses of the 8.4.4.0/24
<http://8.4.4.0/24> range). Obviously, we can convert these IPs to Hex
to make comparisons easier. Given an IP like 8.4.4.9, we need the
object to efficiently determine if it already has an IP range that
encompasses this given IP and if so, return the previously cached
details for that IP pair. If not, we will store that in cache.
The part that I am not fully clear about is the following: how can I
avoid having to loop through every range in the cache? Is there a way
to create a hash function that checks two inequality comparisons
efficiently?
Thanks!
Huji
[1]
https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/findproxy…
_______________________________________________
pywikibot mailing list
pywikibot(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot