Salesforce site is for Customers, NOT for bots !!!
Salesforce Site security
It's quite easy to use
Salesforce site to present public pages over the Internet. But What about
security? 
We won't speak about
SOQL injection, Cross Site Scripting (XSS), Cross Site Request Forgery (CSRF)
or any usual website security risk. We have lot of information and stuff
available out there to understand them!
We are intended to
focus on dedicated Salesforce Site risks, "by design". We will see
the relationship between governor limits and security of public pages, and how
someone could access items that you would not expect.
Security
and Governor Limits 
There are many platform
limitations (governor limits) defined to prevent excessive consumption of
platform resources.
On the other side, if
you reach the limit, your site will be down: "If,
in a given calendar month, your organization reaches 300% of its page view
limit, your sites are disabled until the next calendar month begins
or you purchase more page views." 
The drawback is that someone can make your Salesforce site down just by pushing your site over the governor limits.
The drawback is that someone can make your Salesforce site down just by pushing your site over the governor limits.
The limits
Governor limits are
defined according to the type of Org (Developer, Enterprise, Unlimited), based
on Network and CPU consumption.
| 
Edition | 
Max.
  No.of Sites | 
Bandwidth Limit (per rolling 24-hour period per
  site) | 
Service Request Time (per rolling 24-hour
  period per site) | 
Maximum
  page Views | 
| 
Developer | 
1 | 
500
  MB | 
10
  Minutes | 
N/A | 
| 
Enterprise | 
25 | 
1
  GB (Sandbox) 40 GB (Production) | 
30
  Minutes (Sandbox) 
60
  Hours (Production) | 
500,000 | 
| 
Unlimited | 
25 | 
1
  GB (Sandbox) 40 GB (Production) | 
30
  Minutes (Sandbox)   60 Hours
  (Production) | 
1,000,000 | 
You have created a
pretty Salesforce site to manage your business, and you want to make it cost
effective. 
There are some limits
on the platform in term of CPU, bandwidth and pages viewed per day (as
described in above table), if you reach those limits, you probably will have to
pay to increase them. 
The resource
consumption should be useful, that means targeted to your expected visitors.
Did you know that most of the visits are not real human visits ? 
The web is not a
beautiful place with only friendly people. The underground web is based on
machines. These "bots" (robots) are downloading pages for good or bad
reasons. And each time they get a page from your site, it's part of the
available resources they consume. The issue is the ratio between humans and
bots. If a website has not put in place any protection, you will get more traffic
from bots than humans. 
How to optimize resources? 
The first step is to
prevent crawling from bad bots. Of course, you will have to make a choice
between "good" and "bad". 
For instance, Google, Bing
and a few other are crawling the web to make you appear in the search results.
Don't block them as they will give your real visitors. 
On the other side, you
have a few bots that are crawling the web to get content information that will
be sold: they consume your resources and you don't get money from them - stop
them. 
You can even have bots
that will harvest email addresses from your pages, or try to identify security
issues (such as a form that is not protected by a c aptcha). You absolutely
need to block them. 
The quick win is that Salesforce
is providing you a simple standard option to tell the bots they are not welcome:
the use of a standard file called "robots.txt" (the file is common
for all your Salesforce sites). You just have to define a list and associated
rights. 
By default, Salesforce
will prevent all bots for non production orgs (dev edition etc.). You
absolutely need to define a robots.txt for your production org. The syntax is
quite simple, but the content is not easy to define: how can you know which
robots to put in the file? 
The following content
is a VisualForce page that you will have to add to your org, and then
point to this VF page in your Salesforce site configuration! Taking 5 minutes
to do this can spare lots of money – 
<apex:page contentType="text/plain" showHeader="false">
User-agent: 008
user-agent: AhrefsBot
User-agent: aipbot
User-agent: Alexibot
User-agent: AlvinetSpider
User-agent: Amfibibot
User-agent: Antenne Hatena
User-agent: antibot
User-agent: ApocalXExplorerBot
User-agent: asterias
User-agent: BackDoorBot/1.0
User-agent: BecomeBot
User-agent: Biglotron
User-agent: BizInformation
User-agent: Black Hole
User-agent: BlowFish/1.0
User-agent: BotALot
User-agent: BruinBot
User-agent: BuiltBotTough
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: CatchBot
User-agent: ccubee
User-agent: ccubee/3.5
User-agent: Cegbfeieh
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: Combine
User-agent: ConveraCrawler
User-agent: ConveraMultiMediaCrawler
User-agent: CoolBot
User-agent: CopyRightCheck
User-agent: cosmos
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DimensioNet
User-agent: discobot
User-agent: DISCo Pump 3.1
User-agent: DittoSpyder
User-agent: dotbot
User-agent: Drecombot
User-agent: DTAAgent
User-agent: e-SocietyRobot
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: envolk
User-agent: EroCrawler
User-agent: EverbeeCrawler
User-agent: ExtractorPro
User-agent: Flamingo_SearchEngine
User-agent: Foobot
User-Agent: FDSE
User-agent: g2Crawler
User-agent: genieBot
User-agent: gsa-crawler
User-agent: Harvest/1.5
User-agent: hloader
User-agent: HooWWWer
User-agent: httplib
User-agent: HTTrack
User-agent: HTTrack 3.0
User-agent: humanlinks
User-agent: Igentia
User-agent: InfoNaviRobot
User-agent: Ipselonbot
User-agent: IRLbot
User-agent: JennyBot
User-agent: JikeSpider
User-agent: Jyxobot
User-agent: KavamRingCrawler
User-agent: Kenjin Spider
User-Agent: larbin
User-agent: LexiBot
User-agent: libWeb/clsHTTP
User-agent: LinkextractorPro
User-agent: LinkScan/8.1a Unix
User-agent: linksmanager
User-agent: LinkWalker
User-Agent: lmspider
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: Mata Hari
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: minibot(NaverRobot)/1.0
User-agent: Mister PiX
User-Agent: MJ12bot
User-agent: MLBot
User-agent: moget
User-agent: moget/2.1
User-agent: MS Search 4.0 Robot
User-agent: MS Search 5.0 Robot
User-Agent: MSIECrawler
User-Agent: MyFamilyBot
User-agent: Naverbot
User-agent: NetAnts
User-agent: NetAttache
User-agent: NetMechanic
User-Agent: NetResearchServer
User-agent: NextGenSearchBot
User-agent: NICErsPRO
User-agent: noxtrumbot
User-agent: NPBot
User-agent: Nutch
User-agent: NutchCVS
User-agent: Offline Explorer
User-Agent: OmniExplorer_Bot
User-agent: Openfind
User-agent: OpenindexSpider
User-Agent: OpenIntelligenceData
User-agent: PhpDig
User-agent: pompos
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: psbot
User-agent: QuepasaCreep
User-agent: QueryN Metasearch
User-agent: Radian6
User-agent: R6_FeedFetcher
User-agent: R6_CommentReader
User-agent: RepoMonkey
User-agent: RMA
User-agent: RufusBot
User-Agent: SBIder
User-Agent: schibstedsokbot
User-Agent: ScSpider
User-agent: SearchmetricsBot
User-Agent: semanticdiscovery
User-agent: SemrushBot
User-agent: Shim-Crawler
User-Agent: ShopWiki
User-agent: SightupBot
User-Agent: silk
user-agent: sistrix
user-agent: sitebot
User-agent: SiteSnagger
User-agent: SiteSucker
User-agent: Slurp
User-agent: Sogou web spider
User-agent: sosospider
User-agent: SpankBot
User-agent: spanner
User-agent: Speedy
User-agent: Sproose
User-agent: Steeler
User-agent: suggybot
User-agent: SuperBot
User-agent: SuperBot/2.6
User-agent: suzuran
User-agent: Szukacz/1.4
User-agent: Tarantula
User-agent: Teleport
User-agent: Telesoft
User-agent: The Intraformant
User-agent: TheNomad
User-agent: Theophrastus
User-agent: TightTwatBot
User-agent: Titan
User-agent: toCrawl/UrlDispatcher
User-agent: TosCrawler
User-agent: TridentSpider
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: turingos
User-agent: turnitinbot
User-agent: twiceler
User-agent: Ultraseek
User-agent: UrlPouls
User-agent: URLy Warning
User-agent: Vagabondo
User-agent: VCI
User-agent: Verticrawlbot
User-agent: voyager
User-agent: voyager/1.0
User-agent: Web Image Collector
User-agent: WebAuto
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCopier
User-agent: webcopy
User-agent: WebEnhancer
User-agent: WebIndexer
User-agent: WebmasterWorldForumBot
User-agent: webmirror
User-agent: WebReaper
User-agent: WebSauger
User-agent: website extractor
User-agent: Website Quester
User-agent: Webster Pro
User-agent: WebStripper
User-agent: WebStripper/2.02
User-agent: WebZip
User-agent: Wget
User-agent: WikioFeedBot
User-agent: WinHTTrack
User-agent: WWW-Collector-E
User-agent: Xenu Link Sleuth/1.3.8
User-agent: xirq
User-agent: yacy
User-agent: YRSPider
User-agent: ZeBot
User-agent: ZeBot_www.ze.bz
User-agent: Zeus
User-agent: Zookabot
Disallow: /
User-agent: *
Crawl-delay: 3 
</apex:page>
 Courtesy : www.adminbooster.com/  Thank you so much AdminBooster for such a informative post. 

 
No comments:
Post a Comment