Initial commit
Jeremy D

Jeremy D commited on 2022-03-06 14:24:07
Showing 4 changed files, with 280 additions and 0 deletions.

... ...
@@ -0,0 +1,163 @@
1
+<?php
2
+// Manual Install?
3
+// If the user is manually installing allow them to upload install.php 
4
+// providing its in the same directory as SMF, allow the install to proceed, 
5
+if (!defined('SMF'))
6
+	include_once('SSI.php');
7
+
8
+// Globals
9
+global $db_prefix, $context;
10
+
11
+// If we're uninstalling, we can get out of here.
12
+if (!empty($context['uninstalling']))
13
+	return;
14
+
15
+$addSpider = array(
16
+	// MAJOR spiders							 			Website												Description
17
+	array('Ask', 'Teoma'),									// https://www.ask.com								Spider for Ask Search Engine
18
+	array('Baidu', 'Baiduspider'),							// https://www.baidu.com							Spider for Chinese search engine
19
+	array('Bing', 'bingbot'),								// https://www.bing.com								Spider for Microsoft Bing
20
+	array('GigaBot', 'Gigabot'),							// https://www.gigablast.com						Another heavily travelled spider
21
+	array('DuckDuckGo', 'DuckDuckBot'),						// https://www.duckduckgo.com						Another heavily travelled spider
22
+	array('Google-AdSense', 'Mediapartners-Google'),		// https://www.google.com							Spider related to Adsense/Adwords
23
+	array('Google-Adwords', 'AdsBot-Google'),				// https://www.google.com							Spider related to Adwords
24
+	array('Google-SA', 'gsa-crawler'),						// https://www.google.com							Google Search Appliance Spider
25
+	array('Google-Image', 'Googlebot-Image'),				// https://www.google.com							Spider for google image search
26
+	array('InternetArchive', 'ia_archiver-web.archive.org'),// https://www.archive.org							Way back When machine Spider
27
+	array('Alexa', 'ia_archiver'),							// https://www.alexa.com							*Must be detected after Internet Archive
28
+	array('Omgili', 'omgilibot'),							// https://www.omgili.com							Extremely aggressive Messageboard/forum Spider
29
+	array('Speedy Spider', 'Speedy Spider'),				// https://www.entireweb.com						Entire web spider
30
+	array('Yahoo', 'yahoo'),								// https://www.yahoo.com							For Yahoo Publisher Network  (a variety in use)
31
+	array('Yahoo JP', 'Y!J'),								// https://www.yahoo.co.jp							Spider for Yahoo Japan
32
+	array('Facebook', 'facebot'),							// http://www.facebook.com/externalhit_uatext.php	Spider for Facebook from external link crawling
33
+	array('Facebook External hit', 'facebookexternalhit'),	// http://www.facebook.com/externalhit_uatext.php	Spider for Facebook from external link crawling
34
+	
35
+	// Checkers/Testers/Robots						 		Website												Description
36
+	array('DeadLinksChecker', 'link validator'),			// https://www.dead-links.com/						Checks your site for dead/bad links
37
+	array('W3C Validator', 'W3C_Validator'),				// https://validator.w3.org							Checks standards validity of any html/xhtml page
38
+	array('W3C CSSValidator', 'W3C_CSS_Validator'),			// https://jigsaw.w3.org/css-validator/				Checks standards validity of css stylesheets
39
+	array('W3C FeedValidator', 'FeedValidator'),			// https://validator.w3.org/feed/ 					Checks standards validity of atom/rss feeds
40
+	array('W3C LinkChecker', 'W3C-checklink'),				// https://validator.w3.org/checklink				Checks links on any html/xhtml page are valid
41
+	array('W3C mobileOK', 'W3C-mobileOK'),					// https://www.w3.org/2006/07/mobileok-ddc			Checks page for how good it is for mobiles
42
+	array('W3C P3PValidator', 'P3P Validator'),				// https://www.w3.org/P3P/validator.html			Checks something??
43
+			
44
+	// Feed readers								 			Website												Description
45
+	array('Bloglines', 'Bloglines'),						// https://www.bloglines.com						Spider for blog/rich web content (owned by Ask)
46
+	array('Feedburner', 'Feedburner'),						// https://www.feedburner.com						Another RSS feed reader
47
+	
48
+	// Website Thumbnail/Snapshot/Thumbshot takers		 	Website												Description
49
+	array('SnapBot', 'Snapbot'),							// https://www.snap.com								Shapshots provider
50
+	array('Picsearch', 'psbot'),							// https://www.picsearch.com						Picture/Image Search Engine
51
+	array('Websnapr', 'Websnapr'),							// https://www.websnapr.com							Snapshot/site screenshot taker
52
+			
53
+	// More MINOR Spiders/Robots					 		Website												Description
54
+	array('AllTheWeb', 'FAST-WebCrawler'), 					// https://www.alltheweb.com						Spider for alltheweb (now owned by Yahoo)
55
+	array('Altavista', 'Scooter'),							// https://www.altavista.com						Another Major Search Engine spider
56
+	array('Asterias', 'asterias'),							// https://www.aol.com								Media Spider
57
+	array('AOL', 'AOLBuild'),								// https://www.aol.com								Media Spider
58
+	array('192bot', '192.comAgent'),						// https://www.192.com								Spider to index for 192.com
59
+	array('AbachoBot', 'ABACHOBot'),						// https://www.abacho.com							Spider for multi language search engine/translator
60
+	array('Abdcatos', 'abcdatos'),							// https://www.abcdatos.com/botlink/				Spider for Italian Search Engine
61
+	array('Acoon', 'Acoon'),								// https://www.acoon.de								Spider for small search engine
62
+	array('Accoona', 'Accoona'),							// https://www.accoona.com							Spider for Accoona
63
+	array('BecomeBot', 'BecomeBot'),						// https://www.become.com							Shopping/Products type search engine
64
+	array('BlogRefsBot', 'BlogRefsBot'),					// https://www.blogrefs.com/about/bloggers			Blogs related spider
65
+	array('Daumoa', 'Daumoa'),								// https://ws.daum.net/aboutkr.html					South Korean Search Engine Spider
66
+	array('DuckDuckBot', 'DuckDuckBot'),					// https://duckduckgo.com/duckduckbot.html			Spider for small search engine
67
+	array('Exabot', 'Exabot'),								// https://www.exalead.com							Spider for small search engine
68
+	array('Furl', 'Furlbot'),								// https://www.furl.net								Spider for Furl social bookmarking site
69
+	array('FyperSpider', 'FyberSpider'),					// https://www.fybersearch.com						Spider for Small Search Engine
70
+	array('Geona', 'GeonaBot'),								// https://www.geona.com							Spider for another small search engine
71
+	array('GirafaBot', 'Girafabot'),						// https://www.girafa.com/							Thumbshot provider
72
+	array('GoSeeBot', 'GoSeeBot'),							// https://www.gosee.com/bot.html					Spider for small search engine
73
+	array('Ichiro', 'ichiro'),								// https://help.goo.ne.jp/door/crawler.html			Spider for Japanese search engine
74
+	array('LapozzBot', 'LapozzBot'),						// https://www.lapozz.hu				 			Spider for Hungarian search engine
75
+	array('Looksmart', 'WISENutbot'),						// https://www.looksmart.com						Spider related to advertising
76
+	array('Lycos', 'Lycos_Spider'),							// https://www.lycos.com							Spider for search engine
77
+	array('Majestic12', 'MJ12bot/v2'),						// https://www.majestic12.co.uk/					Distributed Search Engine Project
78
+	array('MLBot', 'MLBot'),								// https://www.metadatalabs.com/					Media indexing spider
79
+	array('MSRBOT', 'msrbot'),								// https://research.microsoft.com/research/sv/mrbot/  	Microsoft Research bot
80
+	array('MSR-ISRCCrawler', 'MSR-ISRCCrawler'),			// https://www.microsoft.com/research/	  			Another Microsoft Research bot
81
+	array('Naver', 'NaverBot'),								// https://www.naver.com							South Korean Search Engine Spider
82
+	array('Naver', 'Yeti'),									// https://www.naver.com							Another NaverBot for the South Korean Search Engine
83
+	array('NoxTrumBot', 'noxtrumbot'),						// https://www.noxtrum.com							Spider for Spanish search engine
84
+	array('OmniExplorer', 'OmniExplorer_Bot'),				// https://www.omni-explorer.com/					Spider
85
+	array('OnetSzukaj', 'OnetSzukaj'),						// https://szukaj.onet.pl							Polish Search Engine Spider
86
+	array('ScrubTheWeb', 'Scrubby'),						// https://www.scrubtheweb.com						Spider for Scrub the web
87
+	array('SearchSight', 'SearchSight'),					// https://www.searchsite.com						Another search engine
88
+	array('Seeqpod', 'Seeqpod'),							// https://www.seeqpod.com							Spider for search engine (the google for mp3 files)
89
+	array('Shablast', 'ShablastBot'),						// https://www.shablast.com							Spider for a small search engine
90
+	array('SitiDiBot', 'SitiDiBot'),						// https://www.sitidi.net							Spider for italian Sitidi search engine
91
+	array('Slider', 'silk/1.0'),							// https://www.slider.com							Spider for Slider, but it only spiders DMOZ entries
92
+	array('Sogou', 'Sogou'),								// https://www.sogou.com							Spider for Chinese search engine
93
+	array('Sosospider', 'Sosospider'),						// https://help.soso.com/webspider.htm				Non-english search engine
94
+	array('StackRambler', 'StackRambler'),					// https://www.rambler.ru/doc/robots.shtml			Spider for Russian portal/search engine
95
+	array('SurveyBot', 'SurveyBot'),						// https://www.domaintools.com						Probe for website statistics (WhoIs  Source)
96
+	array('Touche', 'Touche'),								// https://www.touche.com.ve						Another small search engine
97
+	array('Walhello', 'appie'),								// https://www.wahello.com/							Spider for wahello
98
+	array('WebAlta', 'WebAlta'), 							// https://www.webalta.net							Russian Search Engine
99
+	array('Wisponbot', 'wisponbot'), 						// https://www.wispon.com							Korean Search Engine
100
+	array('YacyBot', 'yacybot'),							// https://www.yacy.com			 					Crawler for distributed search engine
101
+	array('YodaoBot', 'YodaoBot'),							// https://www.yodao.com							Spider for Chinese Search Engine
102
+			
103
+	// Google-Wanna-Be's - Spiders/Robots for Startups		 Website											Description	
104
+	array('Charlotte', 'Charlotte'),						// https://www.searchme.com/support/ 				Spider for new search engine (in beta)
105
+	array('DiscoBot', 'DiscoBot'),							// https://discoveryengine.com/discobot.html		Spider for new search engine startup
106
+	array('EnaBot', 'EnaBot'),								// https://www.enaball.com/crawler.html				Experimental new spider
107
+	array('Gaisbot', 'Gaisbot'),							// https://gais.cs.ccu.edu.tw/robot.php				Spider for search engine startup
108
+	array('Kalooga', 'kalooga'),							// https://www.kalooga.com							Spider for new media search engine (in beta)
109
+	array('ScoutJet', 'ScoutJet'),							// https://www.scoutjet.com/						Spider for new search engine (by the DMOZ founders)
110
+	array('TinEye', 'TinEye'),								// https://tineye.com/crawler.html					Spider for search engine startup
111
+	array('Twiceler', 'twiceler'),							// https://www.cuill.com/twiceler/robot.html		Experimental Spider, (aggressive)
112
+	
113
+	// Software								 				Website												Description
114
+	array('GSiteCrawler', 'GSiteCrawler'),					// https://www.gsitecrawler.com/					Windows Based Sitemap Generator Software
115
+	array('HTTrack', 'HTTrack'),							// https://www.httrack.com							HTTrack Website Copier - Offline Browser
116
+	array('Wget', 'Wget'),									// https://www.gnu.org/software/wget/				GNU software to retrieve files
117
+	// Reason for detecting these: They can be very intensive. So seeing them in use, enables you to block if necessary.
118
+);
119
+
120
+// Correction from v1.0
121
+// Alexa/InternetArchive use similar 
122
+$smcFunc['db_query']('', '
123
+	DELETE FROM {db_prefix}spiders
124
+	WHERE spider_name = {string:spider_name}
125
+		AND user_agent = {string:user_agent}
126
+	',
127
+	array(
128
+		'spider_name' => 'InternetArchive',
129
+		'user_agent' => 'ia_archiver',
130
+	)
131
+);
132
+
133
+// Grab all the existing spiders to match against
134
+$request = $smcFunc['db_query']('', '
135
+	SELECT user_agent
136
+	FROM {db_prefix}spiders',
137
+	array()
138
+);
139
+
140
+$knownspiders = array();
141
+if ($smcFunc['db_num_rows']($request) != 0)
142
+	while ($row = $smcFunc['db_fetch_assoc']($request))
143
+		$knownspiders[] = $row['user_agent'];
144
+$smcFunc['db_free_result']($request);
145
+
146
+// Now go through spider in the mo
147
+foreach ($addSpider as $spider)
148
+	// If doesn't already exist in the table, then add it
149
+	if (!in_array($spider[1], $knownspiders))
150
+		// Now add each spider
151
+		$smcFunc['db_insert']('ignore',
152
+			'{db_prefix}spiders',
153
+			array('spider_name' => 'string', 'user_agent' => 'string', 'ip_info' => 'string'),
154
+			array($spider[0], $spider[1], ''),
155
+			array('spider_name', 'user_agent', 'ip_info')
156
+		);
157
+
158
+//Unset everything
159
+unset($knownspiders, $addSpider, $spider);
160
+
161
+// If we're using SSI, tell them we're done
162
+if (SMF == 'SSI')
163
+	echo 'Database changes are complete!';
0 164
\ No newline at end of file
... ...
@@ -0,0 +1,28 @@
1
+<?xml version="1.0"?>
2
+<!DOCTYPE package-info SYSTEM "http://www.simplemachines.org/xml/package-info">
3
+<package-info xmlns="http://www.simplemachines.org/xml/package-info" xmlns:smf="http://www.simplemachines.org/">
4
+	<name>More Spiders</name>
5
+	<id>karlbenson:morespiders</id>
6
+	<type>modification</type>
7
+	<version>1.3</version>
8
+	
9
+	<!--// Upgrade  (from earlier version of the mod) //-->
10
+	<upgrade for="1.0 - 1.2">
11
+		<readme type="file" parsebbc="true">readme.md</readme>
12
+		<code>AddMoreSpiders.php</code>
13
+	</upgrade>
14
+
15
+	<!--// Install  //-->
16
+	<install>
17
+		<readme type="file" parsebbc="true">readme.txt</readme>
18
+		<code>AddMoreSpiders.php</code>
19
+	</install>
20
+
21
+	<!--// Uninstall  (for what its worth) //-->
22
+	<uninstall>
23
+		<readme type="inline" parsebbc="true">Uninstalling does not remove the Spiders from the database. This is so you don't lose any of your tracking/stats data.  You can remove the spiders via the Spiders tab of the Search Engine tracking section.
24
+		</readme>
25
+		<code>AddMoreSpiders.php</code>
26
+	</uninstall>
27
+
28
+</package-info>
0 29
\ No newline at end of file
... ...
@@ -0,0 +1,42 @@
1
+# MORE SPIDERS
2
+[By Karl Benson](https://custom.simplemachines.org/mods/index.php?action=profile;u=63186) | [Link to Customization](https://custom.simplemachines.org/mods/index.php?mod=1157) | [Support Topic](https://www.simplemachines.org/community/index.php?topic=233636)
3
+
4
+#### Introduction
5
+Adds 88 more spiders/crawlers/bots to your Spiders section in SMF
6
+
7
+#### SMF Support
8
+| Version | Supported |
9
+| ------ | ------ |
10
+| 2.1.x | Yes |
11
+| 2.0.x | Yes |
12
+| 1.x.x | No |
13
+
14
+#### Features
15
+- 83 Spiders belonging to search engines, validators, checkers, crawlers, bots, software, etc.
16
+- Including; Facebook, Ask, Baidu, GigaBot, Google-AdSense, Google-Adwords, Google-SA, Google-Image, Bing, InternetArchive, Alexa, Omgili, Speedy Spider, Yahoo, Yahoo JP, DeadLinksChecker, W3C Validator, W3C CSSValidator, W3C FeedValidator, W3C LinkChecker, W3C mobileOK, W3C P3PValidator, Bloglines, Feedburner, SnapBot, Picsearch, Websnapr, AllTheWeb, Altavista, Asterias, 192bot, AbachoBot, Abdcatos, Acoon, Accoona, BecomeBot, BlogRefsBot, Daumoa, DuckDuckBot, Exabot, Furl, FyperSpider, Geona, GirafaBot, GoSeeBot, Ichiro, LapozzBot, Looksmart, Lycos, Majestic12, MLBot, MSRBOT, MSR-ISRCCrawler, Naver, Naver, NoxTrumBot, OmniExplorer, OnetSzukaj, ScrubTheWeb, SearchSight, Seeqpod, Shablast, SitiDiBot, Slider, Sogou, Sosospider, StackRambler, SurveyBot, Touche, Walhello, WebAlta, Wisponbot, YacyBot, YodaoBot, Charlotte, DiscoBot, EnaBot, Gaisbot, Kalooga, ScoutJet, TinEye, Twiceler, GSiteCrawler, HTTrack, Wget
17
+*(Remember SMF detects most Google/Yahoo/MSN bots by default)*
18
+
19
+#### Spider List
20
+It appears that most sites offering spider/bot lists have tonnes of inactive ones. I'm putting together my own lists by detecting them in the wild on my own sites. Plus any that get reported to me (after I've checked them out). So if there are any ACTIVE ones I'm missing? Please let me know in the support topic.
21
+
22
+#### Installation
23
+**Any previous versions of the mod does NOT need to be uninstalled.**
24
+This mod adds a row for each spider in the database only.
25
+ - No theme edits required.
26
+ - No language strings to translate or editing.
27
+ - It adds database rows only.
28
+ - It will ignore adding ones which already exist.
29
+
30
+Install and your done.
31
+*Note: Click uninstall to remove the mod from your mod list. But it won't remove the spiders from your database. You'll need to remove each one from your SMF Search Engines/Spider section.*
32
+
33
+#### Manual Installation
34
+For manual installation, just upload AddMoreSpiders.php to your SMF directory and run it in your browser (then delete the file).
35
+
36
+#### Useful Links
37
+ - [Manual Installation Of Mods](https://wiki.simplemachines.org/smf/Manual_installation_of_mods)
38
+ - [How Do I Modify Files?](https://www.simplemachines.org/community/index.php?topic=24110.0)
39
+
40
+#### Support
41
+Please use the modification thread for support with this modification.
42
+*(Please don't ask me to do the edits for you)*
0 43
\ No newline at end of file
... ...
@@ -0,0 +1,47 @@
1
+[size=4]MORE SPIDERS[/size]
2
+[url=https://custom.simplemachines.org/mods/index.php?action=profile;u=63186]By Karl Benson[/url] | [url=https://custom.simplemachines.org/mods/index.php?mod=1157]Link to Customization[/url] | [url=https://www.simplemachines.org/community/index.php?topic=233636]Support Topic[/url]
3
+
4
+[size=3]Introduction[/size]
5
+Adds 88 more spiders/crawlers/bots to your Spiders section in SMF
6
+
7
+[size=3]SMF Support[/size]
8
+[table]
9
+[tr][td]Version[/td][td]Supported[/td][/tr]
10
+[tr][td]2.1.x[/td][td]Yes[/td][/tr]
11
+[tr][td]2.0.x[/td][td]Yes[/td][/tr]
12
+[tr][td]1.x.x[/td][td]No[/td][/tr]
13
+[/table]
14
+
15
+[size=3]Features[/size]
16
+[list]
17
+[li]83 Spiders belonging to search engines, validators, checkers, crawlers, bots, software, etc.[/li]
18
+[li]Including; Facebook, Ask, Baidu, GigaBot, Google-AdSense, Google-Adwords, Google-SA, Google-Image, Bing, InternetArchive, Alexa, Omgili, Speedy Spider, Yahoo, Yahoo JP, DeadLinksChecker, W3C Validator, W3C CSSValidator, W3C FeedValidator, W3C LinkChecker, W3C mobileOK, W3C P3PValidator, Bloglines, Feedburner, SnapBot, Picsearch, Websnapr, AllTheWeb, Altavista, Asterias, 192bot, AbachoBot, Abdcatos, Acoon, Accoona, BecomeBot, BlogRefsBot, Daumoa, DuckDuckBot, Exabot, Furl, FyperSpider, Geona, GirafaBot, GoSeeBot, Ichiro, LapozzBot, Looksmart, Lycos, Majestic12, MLBot, MSRBOT, MSR-ISRCCrawler, Naver, Naver, NoxTrumBot, OmniExplorer, OnetSzukaj, ScrubTheWeb, SearchSight, Seeqpod, Shablast, SitiDiBot, Slider, Sogou, Sosospider, StackRambler, SurveyBot, Touche, Walhello, WebAlta, Wisponbot, YacyBot, YodaoBot, Charlotte, DiscoBot, EnaBot, Gaisbot, Kalooga, ScoutJet, TinEye, Twiceler, GSiteCrawler, HTTrack, Wget
19
+[i](Remember SMF detects most Google/Yahoo/MSN bots by default)[/i][/li]
20
+[/list]
21
+
22
+[size=3]Spider List[/size]
23
+It appears that most sites offering spider/bot lists have tonnes of inactive ones. I'm putting together my own lists by detecting them in the wild on my own sites. Plus any that get reported to me (after I've checked them out). So if there are any ACTIVE ones I'm missing? Please let me know in the support topic.
24
+
25
+[size=3]Installation[/size]
26
+[b]Any previous versions of the mod does NOT need to be uninstalled.[/b]
27
+This mod adds a row for each spider in the database only.
28
+[*]No theme edits required.
29
+[*]No language strings to translate or editing.
30
+[*]It adds database rows only.
31
+[*]It will ignore adding ones which already exist.
32
+
33
+Install and your done.
34
+[i]Note: Click uninstall to remove the mod from your mod list. But it won't remove the spiders from your database. You'll need to remove each one from your SMF Search Engines/Spider section.[/i]
35
+
36
+[size=3]Manual Installation[/size]
37
+For manual installation, just upload AddMoreSpiders.php to your SMF directory and run it in your browser (then delete the file).
38
+
39
+[size=3]Useful Links[/size]
40
+[list]
41
+[li][url=https://wiki.simplemachines.org/smf/Manual_installation_of_mods]Manual Installation Of Mods[/url][/li]
42
+[li][url=https://www.simplemachines.org/community/index.php?topic=24110.0]How Do I Modify Files?[/url][/li]
43
+[/list]
44
+
45
+[size=3]Support[/size]
46
+Please use the modification thread for support with this modification.
47
+[i](Please don't ask me to do the edits for you)[/i]
0 48
\ No newline at end of file
1 49