Hey everybody, Scratch my idea to create a MAS, or webcrawler to automate the search. Searching manually is faster. Using the list of local authority websites I found 11 "bugs" in about 5 minutes, just going through the 'A' part.
http://www.aalten.nl/index.php?simaction=content&mediumid=1&pagid=15... http://www.albrandswaard.nl/index.php?simaction=content&mediumid=1&p... http://www.alkmaar.nl/eCache/23220/Aanvraagformulieren http://www.almere.nl/dienstverlening/logo http://www.alphenaandenrijn.nl/Smartsite.shtml?id=12248 http://www.ameland.nl/index.php?simaction=content&mediumid=1&pagid=5... http://www.amersfoort.nl/smartsite.shtml?id=189546 http://www.amsterdam.nl/jeugd_onderwijs/onderwijsbeleid/publicaties/publicat... http://www.annapaulowna.nl/index.php?mediumid=1&pagid=196&simaction=... http://www.appingedam.nl/index.php?simaction=content&mediumid=1&pagi... http://www.arnhem.nl/content.jsp?objectid=113726
Maybe we can split the ODS-up alphabetically into manageable chunks and make it a distributed humanoid search system? Who's in?
Cheerio, Jelle
On 07/13/2011 12:36 PM, Jelle Hermsen wrote:
I still have a pretty big database of government website urls lying around somewhere, so I might be able to automate some of the "hunt":) I'm very interested.
It's attached to this e-mail. If I have some time I might quickly whip up a tiny multi agent system to automatically search the websites.
* Jelle Hermsen jelle@fsfe.org [2011-07-13 13:28:26 +0200]:
Scratch my idea to create a MAS, or webcrawler to automate the search. Searching manually is faster.
I don't know if the attached script might help with that. Would be interested in your view.
Regards, Matthias
Thanks Matthias!
Yep, that script would definitely do the job. 1 problem is that it uses the Google search API, and that bugger has a bit of nasty EULA, which only allows you to do 100 or so automated queries per day. That's not really a big problem in this case, but since we'll also need to find the information so we can contact the specific local government in question there's already a bit of manual labor involved. Adding a manual search using "site:" doesn't make the manual part that much bigger.
I do however see that this might be perceived as a "brain-dead" job. Doesn't really worry me much. I have worked as a code-monkey for 2 years at a firm, banging out Typo3 and Joomla sites round the clock, so "brain-dead" doesn't really bother me much :-) But anyone who wants to chip in with half a brain can use this script. I tried it and it works pretty well. It gives an error when it doesn't find anything, but I guess the programmer was in a bit of an pessimistic mood when he made it ;)
It needs the simplejson module. To install this on Debian Squeeze you can use:
sudo aptitude install python-simplejson
You can run it using (I know Almere's website has adobe reader advertisments on it): python find-acrobat-commercial.py almere.nl
Cheerio, Jelle
On 07/15/2011 02:20 PM, Matthias Kirschner wrote:
- Jelle Hermsenjelle@fsfe.org [2011-07-13 13:28:26 +0200]:
Scratch my idea to create a MAS, or webcrawler to automate the search. Searching manually is faster.
I don't know if the attached script might help with that. Would be interested in your view.
Regards, Matthias
Hi Jelle,
just a short reply:
* Jelle Hermsen jelle@fsfe.org [2011-07-15 14:50:54 +0200]:
I do however see that this might be perceived as a "brain-dead" job.
I think most important is that we fix the http://fsfe.org/campaigns/pdfreaders/buglist.en.html and than continue with others whenever we have fun doing so (e.g. at a bug fixing party with drinks and pizza :) ).
All the best, Matthias
I think most important is that we fix the http://fsfe.org/campaigns/pdfreaders/buglist.en.html and than continue with others whenever we have fun doing so (e.g. at a bug fixing party with drinks and pizza :) )
Yeah, you're probably right. I have been known for jumping the gun :)
Matthias Kirschner wrote:
- Jelle Hermsen jelle@fsfe.org [2011-07-13 13:28:26 +0200]:
Scratch my idea to create a MAS, or webcrawler to automate the search. Searching manually is faster.
I don't know if the attached script might help with that. Would be interested in your view.
Note that Adobe Acrobat Reader was renamed to Adobe Reader many years ago, so it would be useful to include a search pattern for it.
Hi Sam,
* Sam Geeraerts samgee@fsfe.org [2011-07-19 17:22:18 +0200]:
Scratch my idea to create a MAS, or webcrawler to automate the search. Searching manually is faster.
I don't know if the attached script might help with that. Would be interested in your view.
Note that Adobe Acrobat Reader was renamed to Adobe Reader many years ago, so it would be useful to include a search pattern for it.
Thanks for the notice. I CCed Ole who wrote the scrit.
Regards, Matthias
On Wed, Jul 20, 2011 at 9:08 AM, Matthias Kirschner mk@fsfe.org wrote:
Hi Sam,
- Sam Geeraerts samgee@fsfe.org [2011-07-19 17:22:18 +0200]:
Note that Adobe Acrobat Reader was renamed to Adobe Reader many years ago, so it would be useful to include a search pattern for it.
Thanks for the notice. I CCed Ole who wrote the scrit.
According to http://en.wikipedia.org/wiki/Adobe_Acrobat the names have been Adobe Reader and Acrobat Reader. Searching for "Reader" is probably not a good idea, but doing both a search for 'Adobe' and a search for 'Acrobat' would probably make sense.
I just tested searching for Adobe and that gives a lot of false positives (Adobe Flash), however, if I instead search for "Adobe Reader" that works better.
Should we put the script on a public git repository so we can maintain it together?
/Ole
* Ole Tange tange@gnu.org [2011-07-20 11:05:44 +0200]:
Should we put the script on a public git repository so we can maintain it together?
That sounds good.
Thanks a lot, Matthias
Hi Ole,
* Matthias Kirschner mk@fsfe.org [2011-07-20 13:47:53 +0200]:
- Ole Tange tange@gnu.org [2011-07-20 11:05:44 +0200]:
Should we put the script on a public git repository so we can maintain it together?
That sounds good.
Is it possible to generate a output text file with:
- Domain Name, one or two example URL of the advertisement
With this format it is easier for others to follow-up.
Regards, Matthias
On Thu, Jul 21, 2011 at 10:07 AM, Matthias Kirschner mk@fsfe.org wrote:
- Matthias Kirschner mk@fsfe.org [2011-07-20 13:47:53 +0200]:
- Ole Tange tange@gnu.org [2011-07-20 11:05:44 +0200]:
Should we put the script on a public git repository so we can maintain it together?
That sounds good.
I have requested a git account at Savannah, but apparently they need to manually approve it - so it may take some time.
Is it possible to generate a output text file with:
- Domain Name, one or two example URL of the advertisement
With this format it is easier for others to follow-up.
This version outputs:
domain name \t Whether it is only in Google's cache or still exists \t URL \t title of page
It imports directly into LibreOffice as TSV, just name the output file foo.csv.
I have run it on um.dk which is attached.
/Ole
Ole Tange wrote:
This version outputs:
domain name \t Whether it is only in Google's cache or still exists \t URL \t title of page
It imports directly into LibreOffice as TSV, just name the output file foo.csv.
I have run it on um.dk which is attached.
I suppose there's no way around the limited number of queries? I don't have a Google account and I don't plan on creating one. But 100 queries is probably enough most of the time. Can we undo needing double the amount of queries by using an OR clause in the query?
On Thu, Jul 21, 2011 at 10:08 PM, Sam Geeraerts samgee@fsfe.org wrote:
I suppose there's no way around the limited number of queries?
It is possible to raise the limit by paying.
Maybe Seeks can be used instead of Google? I, however, have no experience in programing stuff against Seeks.
I don't have a Google account and I don't plan on creating one. But 100 queries is probably enough most of the time. Can we undo needing double the amount of queries by using an OR clause in the query?
I could not get the OR clause to work.
However, the script now pauses, so if you run the script in serial (i.e not parallel) then it should stay below the 100/day. So leave running for a few days to slowly work its way through the domains.
/Ole
Ole Tange wrote:
Maybe Seeks can be used instead of Google? I, however, have no experience in programing stuff against Seeks.
Me neither.
I could not get the OR clause to work.
My test with 'site:'+domain+' acrobat OR "adobe reader" -filetype:pdf' seemed to work.
However, the script now pauses, so if you run the script in serial (i.e not parallel) then it should stay below the 100/day. So leave running for a few days to slowly work its way through the domains.
I think it's useful to limit the number of search results per requested domain. We only really want to know if the website still has at least one issue and maybe just get a few examples to point out to the website maintainers when we contact them.
On Sun, Jul 24, 2011 at 1:58 PM, Sam Geeraerts samgee@fsfe.org wrote:
Ole Tange wrote:
Maybe Seeks can be used instead of Google? I, however, have no experience in programing stuff against Seeks.
Me neither.
I just asked the Seeks people: Seeks is a meta-frontend so it should would.
I could not get the OR clause to work.
My test with 'site:'+domain+' acrobat OR "adobe reader" -filetype:pdf' seemed to work.
Great.
However, the script now pauses, so if you run the script in serial (i.e not parallel) then it should stay below the 100/day. So leave running for a few days to slowly work its way through the domains.
I think it's useful to limit the number of search results per requested domain. We only really want to know if the website still has at least one issue and maybe just get a few examples to point out to the website maintainers when we contact them.
That is not a good service: If they fix the problem for the 2 examples and for all future pages, we will still be getting back to them next time we do this. I think it is a much better service to help them find what pages have the problem - we are most likely better at doing that than they are anyway.
/Ole
Hi Ole,
* Ole Tange tange@gnu.org [2011-07-24 21:50:09 +0200]:
That is not a good service: If they fix the problem for the 2 examples and for all future pages, we will still be getting back to them next time we do this. I think it is a much better service to help them find what pages have the problem - we are most likely better at doing that than they are anyway.
Sounds reasonable, especially if we send them e-mails, where we can attache all the URLs.
Thanks, Matthias
Ole Tange wrote:
On Sun, Jul 24, 2011 at 1:58 PM, Sam Geeraerts samgee@fsfe.org wrote: I just asked the Seeks people: Seeks is a meta-frontend so it should would.
There seems to be an API [1].
I think it's useful to limit the number of search results per requested domain. We only really want to know if the website still has at least one issue and maybe just get a few examples to point out to the website maintainers when we contact them.
That is not a good service: If they fix the problem for the 2 examples and for all future pages, we will still be getting back to them next time we do this. I think it is a much better service to help them find what pages have the problem - we are most likely better at doing that than they are anyway.
I was assuming that they'd realize that their website can have multiple references or that the reference would be handled by one component of a CMS. Your reasoning makes sense and is safer.
[1] http://seeks-project.info/wiki/index.php/Seeks_JSON_Search_API
On Wed, Jul 20, 2011 at 11:05 AM, Ole Tange tange@gnu.org wrote:
On Wed, Jul 20, 2011 at 9:08 AM, Matthias Kirschner mk@fsfe.org wrote:
Should we put the script on a public git repository so we can maintain it together?
git clone git://git.savannah.nongnu.org/pdfcom.git
I found both Sam and Matthias in Savannah, so you are added as members.
/Ole
Ole Tange wrote:
git clone git://git.savannah.nongnu.org/pdfcom.git
I found both Sam and Matthias in Savannah, so you are added as members.
Damn, I've been able to avoid learning git until now. I guess I'll have to look into it. :)
By the way, I should have mentioned this earlier, but the name "PDF commercial identifier" is a bit unfortunate. "Commercial" is probably meant in the sense of "advertisement", but it gives the impression that it looks for "commercial software". That suggests that only non-free software can be commercial, which is of course not true.
P.S.: I thought it was about time I subscribe to this list, but I can't find any information webpage about it. [1] or [2] would be a logical place to find something like that, IMO.
[1] http://fsfe.org/campaigns/pdfreaders/pdfreaders.en.html [2] http://lists.fsfe.org/
* Sam Geeraerts samgee@fsfe.org [2011-07-25 13:47:02 +0200]:
Ole Tange wrote:
git clone git://git.savannah.nongnu.org/pdfcom.git
I found both Sam and Matthias in Savannah, so you are added as members.
Damn, I've been able to avoid learning git until now. I guess I'll have to look into it. :)
Sorry, we are forcing you to learn new stuff to participate ;)
By the way, I should have mentioned this earlier, but the name "PDF commercial identifier" is a bit unfortunate. "Commercial" is probably meant in the sense of "advertisement", but it gives the impression that it looks for "commercial software". That suggests that only non-free software can be commercial, which is of course not true.
That's true. Ole is it ok, if we brainstrom about a cool name? Things like GATA: "Government Adds Terminator Assistant" come into my mind ;)
P.S.: I thought it was about time I subscribe to this list, but I can't find any information webpage about it. [1] or [2] would be a logical place to find something like that, IMO.
That's because at the beginning this was mainly a list for internal coordination. As you want to be active here, please subscribe under: https://lists.fsfe.org/mailman/listinfo/pdfreaders
All the best, Matthias
On Mon, Jul 25, 2011 at 2:13 PM, Matthias Kirschner mk@fsfe.org wrote:
- Sam Geeraerts samgee@fsfe.org [2011-07-25 13:47:02 +0200]:
Ole Tange wrote:
git clone git://git.savannah.nongnu.org/pdfcom.git
I found both Sam and Matthias in Savannah, so you are added as members.
Damn, I've been able to avoid learning git until now. I guess I'll have to look into it. :)
Do initial checkout:
$ git clone yourlogin@git.sv.gnu.org:/srv/git/pdfcom.git
Daily work:
$ git pull <<do your changes>> (If others are busy checking in, too. Then get their changes: $ git pull ) $ git commit -a $ git push
The only hard part I found from coming form other VCSs is that 'commit' does not send the changes to the server: You also have to push.
By the way, I should have mentioned this earlier, but the name "PDF commercial identifier" is a bit unfortunate. "Commercial" is probably meant in the sense of "advertisement", but it gives the impression that it looks for "commercial software". That suggests that only non-free software can be commercial, which is of course not true.
You are right: "PDF advertisment finder" would clearly be less ambiguous.
That's true. Ole is it ok, if we brainstrom about a cool name? Things like GATA: "Government Adds Terminator Assistant" come into my mind ;)
Sure. If I were you I would choose a name so that if people knew the function but not the name then they would be able to find it using Google. So PDF should probably be part of the name.
As I do bioinformatics and we work with DNA then the letters A, C, G, and T take on a whole new meaning - thus words containing only A, C, G, and T are always assumed to be DNA related.
/Ole
Ole Tange wrote:
Do initial checkout:
$ git clone yourlogin@git.sv.gnu.org:/srv/git/pdfcom.git
Daily work:
$ git pull <<do your changes>> (If others are busy checking in, too. Then get their changes: $ git pull ) $ git commit -a $ git push
The only hard part I found from coming form other VCSs is that 'commit' does not send the changes to the server: You also have to push.
Cool. So, much like the bzr I'm more used to.
Sure. If I were you I would choose a name so that if people knew the function but not the name then they would be able to find it using Google. So PDF should probably be part of the name.
Makes sense.
Matthias Kirschner wrote:
That's true. Ole is it ok, if we brainstrom about a cool name? Things like GATA: "Government Adds Terminator Assistant" come into my mind ;)
Yay, brainstorming! Let me go crazy for a bit:
- PRONTO: PDF Readers Ought Not To Offend - PReSTo: PDF Reader Search Tool - PROSIT: PDF Reader On Site Investigation Tool - PROBE: PdfReaders.Org {Baring,Batch,Bettering,Bias-finding} Equipment, PdfReaders.Org Bug Evincer - PDFreeder
That's because at the beginning this was mainly a list for internal coordination. As you want to be active here, please subscribe under: https://lists.fsfe.org/mailman/listinfo/pdfreaders
Subscription request pending.
* Sam Geeraerts samgee@fsfe.org [2011-07-25 20:29:36 +0200]:
That's because at the beginning this was mainly a list for internal coordination. As you want to be active here, please subscribe under: https://lists.fsfe.org/mailman/listinfo/pdfreaders
Subscription request pending.
I just approved your request. Can you write a firm introducation for the rest here?
Thanks, Matthias
* Matthias Kirschner mk@fsfe.org [2011-07-26 10:03:50 +0200]:
I just approved your request. Can you write a firm introducation for the rest here?
Do you also want to have access to the pdfreaders.org website. Than you could help to improve things there when we get requests.
Thanks, Matthias
* Sam Geeraerts samgee@fsfe.org [2011-07-26 11:31:08 +0200]:
Matthias Kirschner wrote:
Do you also want to have access to the pdfreaders.org website. Than you could help to improve things there when we get requests.
I think I'll pass (for now) and stay in the fringes. I'm busy enough as it is. :)
Ok. I understand.
Regards, Matthias
Matthias Kirschner wrote:
I just approved your request. Can you write a firm introducation for the rest here?
My name is Sam Geeraerts. I live in Belgium. I've been using free software for over 10 years. I'm one of the maintainers of gNewSense. I've been looking with interest at FSFE for some years, but I've only recently really became a member, mainly because I never got around to actually doing it. The two most important FSFE activities for me personally currently are the Dutch branch and the PDFReaders campaign. It's one of the best free software campaign ideas I've seen so far.
Hi Sam,
* Sam Geeraerts samgee@fsfe.org [2011-07-25 20:29:36 +0200]:
Matthias Kirschner wrote:
That's true. Ole is it ok, if we brainstrom about a cool name? Things like GATA: "Government Adds Terminator Assistant" come into my mind ;)
Yay, brainstorming! Let me go crazy for a bit:
- PRONTO: PDF Readers Ought Not To Offend
- PReSTo: PDF Reader Search Tool
- PROSIT: PDF Reader On Site Investigation Tool
- PROBE: PdfReaders.Org {Baring,Batch,Bettering,Bias-finding}
Equipment, PdfReaders.Org Bug Evincer
- PDFreeder
Cool, they all sind nice. I think I like PROSIT most :) What do other think? New ideas? Comments to the suggestions?
All the best, Matthias
Le 26/07/2011 10:15, Matthias Kirschner a écrit :
- Sam Geeraertssamgee@fsfe.org [2011-07-25 20:29:36 +0200]:
Yay, brainstorming! Let me go crazy for a bit:
- PRONTO: PDF Readers Ought Not To Offend
- PReSTo: PDF Reader Search Tool
- PROSIT: PDF Reader On Site Investigation Tool
- PROBE: PdfReaders.Org {Baring,Batch,Bettering,Bias-finding}
Equipment, PdfReaders.Org Bug Evincer
- PDFreeder
Cool, they all sind nice. I think I like PROSIT most :) What do other think? New ideas? Comments to the suggestions?
PROSIT is very good in my opinion too. I would put an 's' to "Reader", what do you think?
Cheers, Nico
* Nicolas JEAN nicoulas@fsfe.org [2011-07-26 17:19:49 +0200]:
PROSIT is very good in my opinion too. I would put an 's' to "Reader", what do you think?
Yes, that makes sense. Matthias