I'm very curious if you can run at Wikipedia scale with such a trie in
memory on a normal computer (e.g. with only tens of GiB of memory). Please
let us know if you actually get this into production (or just submit the
script for inclusion in the framework, it sounds really useful)
Strainu
Pe vineri, 12 iulie 2019, Lucas Werkmeister <mail(a)lucaswerkmeister.de> a
scris:
You probably want to use a trie
<https://en.wikipedia.org/wiki/Trie> for
this – I found several available Python implementations, but I don’t know
what their advantages or disadvantages are, so I’ll just list them in
alphabetical order:
- cidr-tree <https://github.com/Figglewatts/cidr-trie>
- py-radix <https://github.com/Figglewatts/cidr-trie>
- pysubnettree <https://github.com/zeek/pysubnettree>
- pytricia <https://github.com/jsommers/pytricia>
Cheers,
Lucas
On 12.07.19 04:43, Huji Lee wrote:
Hi all,
I am working on a bot that fetches a list of anonymous editors on fawiki,
uses WHOIS to retrieve more info about that IP, and uses a number of online
APIs to check if the IP is a proxy or not.[1]
I would like to improve the code by implementing a CIDR cache, so that if
I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 and
then I encounter 8.4.4.9 in the next iteration of my for loop, I would
quickly determine this IP also belongs to the same range and skip the WHOIS
part for it.
The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25"
(these are the beginning and end IP addresses of the 8.4.4.0/24 range).
Obviously, we can convert these IPs to Hex to make comparisons easier.
Given an IP like 8.4.4.9, we need the object to efficiently determine if it
already has an IP range that encompasses this given IP and if so, return
the previously cached details for that IP pair. If not, we will store that
in cache.
The part that I am not fully clear about is the following: how can I avoid
having to loop through every range in the cache? Is there a way to create a
hash function that checks two inequality comparisons efficiently?
Thanks!
Huji
[1]
https://github.com/PersianWikipedia/fawikibot/
blob/master/HujiBot/findproxy.py
_______________________________________________
pywikibot mailing
listpywikibot@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/pywikibot