Thursday, 9 April 2015

Salesforce Site Security, Site for Customers NOT for robots !?

Salesforce site is for Customers, NOT for bots !!!
Salesforce Site security

It's quite easy to use Salesforce site to present public pages over the Internet. But What about security?
We won't speak about SOQL injection, Cross Site Scripting (XSS), Cross Site Request Forgery (CSRF) or any usual website security risk. We have lot of information and stuff available out there to understand them!
We are intended to focus on dedicated Salesforce Site risks, "by design". We will see the relationship between governor limits and security of public pages, and how someone could access items that you would not expect.

Security and Governor Limits
There are many platform limitations (governor limits) defined to prevent excessive consumption of platform resources.
On the other side, if you reach the limit, your site will be down: "If, in a given calendar month, your organization reaches 300% of its page view limit, your sites are disabled until the next calendar month begins or you purchase more page views." 
The drawback is that someone can make your Salesforce site down just by pushing your site over the governor limits.

The limits

Governor limits are defined according to the type of Org (Developer, Enterprise, Unlimited), based on Network and CPU consumption.
Edition
Max. No.of Sites
Bandwidth Limit (per rolling 24-hour period per site)
Service Request Time (per rolling 24-hour period per site)
Maximum page Views
Developer
1
500 MB
10 Minutes
N/A
Enterprise
25
1 GB (Sandbox) 40 GB (Production)
30 Minutes (Sandbox)
60 Hours (Production)
500,000
Unlimited
25
1 GB (Sandbox) 40 GB (Production)
30 Minutes (Sandbox)   60 Hours (Production)
1,000,000

You have created a pretty Salesforce site to manage your business, and you want to make it cost effective. 
There are some limits on the platform in term of CPU, bandwidth and pages viewed per day (as described in above table), if you reach those limits, you probably will have to pay to increase them. 
The resource consumption should be useful, that means targeted to your expected visitors. Did you know that most of the visits are not real human visits ? 
The web is not a beautiful place with only friendly people. The underground web is based on machines. These "bots" (robots) are downloading pages for good or bad reasons. And each time they get a page from your site, it's part of the available resources they consume. The issue is the ratio between humans and bots. If a website has not put in place any protection, you will get more traffic from bots than humans. 

How to optimize resources?
The first step is to prevent crawling from bad bots. Of course, you will have to make a choice between "good" and "bad".
For instance, Google, Bing and a few other are crawling the web to make you appear in the search results. Don't block them as they will give your real visitors.
On the other side, you have a few bots that are crawling the web to get content information that will be sold: they consume your resources and you don't get money from them - stop them.
You can even have bots that will harvest email addresses from your pages, or try to identify security issues (such as a form that is not protected by a c aptcha). You absolutely need to block them. 
The quick win is that Salesforce is providing you a simple standard option to tell the bots they are not welcome: the use of a standard file called "robots.txt" (the file is common for all your Salesforce sites). You just have to define a list and associated rights.
By default, Salesforce will prevent all bots for non production orgs (dev edition etc.). You absolutely need to define a robots.txt for your production org. The syntax is quite simple, but the content is not easy to define: how can you know which robots to put in the file?
The following content is a VisualForce page that you will have to add to your org, and then point to this VF page in your Salesforce site configuration! Taking 5 minutes to do this can spare lots of money –

<apex:page contentType="text/plain" showHeader="false">
User-agent: 008
user-agent: AhrefsBot
User-agent: aipbot
User-agent: Alexibot
User-agent: AlvinetSpider
User-agent: Amfibibot
User-agent: Antenne Hatena
User-agent: antibot
User-agent: ApocalXExplorerBot
User-agent: asterias
User-agent: BackDoorBot/1.0
User-agent: BecomeBot
User-agent: Biglotron
User-agent: BizInformation
User-agent: Black Hole
User-agent: BlowFish/1.0
User-agent: BotALot
User-agent: BruinBot
User-agent: BuiltBotTough
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: CatchBot
User-agent: ccubee
User-agent: ccubee/3.5
User-agent: Cegbfeieh
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: Combine
User-agent: ConveraCrawler
User-agent: ConveraMultiMediaCrawler
User-agent: CoolBot
User-agent: CopyRightCheck
User-agent: cosmos
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DimensioNet
User-agent: discobot
User-agent: DISCo Pump 3.1
User-agent: DittoSpyder
User-agent: dotbot
User-agent: Drecombot
User-agent: DTAAgent
User-agent: e-SocietyRobot
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: envolk
User-agent: EroCrawler
User-agent: EverbeeCrawler
User-agent: ExtractorPro
User-agent: Flamingo_SearchEngine
User-agent: Foobot
User-Agent: FDSE
User-agent: g2Crawler
User-agent: genieBot
User-agent: gsa-crawler
User-agent: Harvest/1.5
User-agent: hloader
User-agent: HooWWWer
User-agent: httplib
User-agent: HTTrack
User-agent: HTTrack 3.0
User-agent: humanlinks
User-agent: Igentia
User-agent: InfoNaviRobot
User-agent: Ipselonbot
User-agent: IRLbot
User-agent: JennyBot
User-agent: JikeSpider
User-agent: Jyxobot
User-agent: KavamRingCrawler
User-agent: Kenjin Spider
User-Agent: larbin
User-agent: LexiBot
User-agent: libWeb/clsHTTP
User-agent: LinkextractorPro
User-agent: LinkScan/8.1a Unix
User-agent: linksmanager
User-agent: LinkWalker
User-Agent: lmspider
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: Mata Hari
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: minibot(NaverRobot)/1.0
User-agent: Mister PiX
User-Agent: MJ12bot
User-agent: MLBot
User-agent: moget
User-agent: moget/2.1
User-agent: MS Search 4.0 Robot
User-agent: MS Search 5.0 Robot
User-Agent: MSIECrawler
User-Agent: MyFamilyBot
User-agent: Naverbot
User-agent: NetAnts
User-agent: NetAttache
User-agent: NetMechanic
User-Agent: NetResearchServer
User-agent: NextGenSearchBot
User-agent: NICErsPRO
User-agent: noxtrumbot
User-agent: NPBot
User-agent: Nutch
User-agent: NutchCVS
User-agent: Offline Explorer
User-Agent: OmniExplorer_Bot
User-agent: Openfind
User-agent: OpenindexSpider
User-Agent: OpenIntelligenceData
User-agent: PhpDig
User-agent: pompos
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: psbot
User-agent: QuepasaCreep
User-agent: QueryN Metasearch
User-agent: Radian6
User-agent: R6_FeedFetcher
User-agent: R6_CommentReader
User-agent: RepoMonkey
User-agent: RMA
User-agent: RufusBot
User-Agent: SBIder
User-Agent: schibstedsokbot
User-Agent: ScSpider
User-agent: SearchmetricsBot
User-Agent: semanticdiscovery
User-agent: SemrushBot
User-agent: Shim-Crawler
User-Agent: ShopWiki
User-agent: SightupBot
User-Agent: silk
user-agent: sistrix
user-agent: sitebot
User-agent: SiteSnagger
User-agent: SiteSucker
User-agent: Slurp
User-agent: Sogou web spider
User-agent: sosospider
User-agent: SpankBot
User-agent: spanner
User-agent: Speedy
User-agent: Sproose
User-agent: Steeler
User-agent: suggybot
User-agent: SuperBot
User-agent: SuperBot/2.6
User-agent: suzuran
User-agent: Szukacz/1.4
User-agent: Tarantula
User-agent: Teleport
User-agent: Telesoft
User-agent: The Intraformant
User-agent: TheNomad
User-agent: Theophrastus
User-agent: TightTwatBot
User-agent: Titan
User-agent: toCrawl/UrlDispatcher
User-agent: TosCrawler
User-agent: TridentSpider
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: turingos
User-agent: turnitinbot
User-agent: twiceler
User-agent: Ultraseek
User-agent: UrlPouls
User-agent: URLy Warning
User-agent: Vagabondo
User-agent: VCI
User-agent: Verticrawlbot
User-agent: voyager
User-agent: voyager/1.0
User-agent: Web Image Collector
User-agent: WebAuto
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCopier
User-agent: webcopy
User-agent: WebEnhancer
User-agent: WebIndexer
User-agent: WebmasterWorldForumBot
User-agent: webmirror
User-agent: WebReaper
User-agent: WebSauger
User-agent: website extractor
User-agent: Website Quester
User-agent: Webster Pro
User-agent: WebStripper
User-agent: WebStripper/2.02
User-agent: WebZip
User-agent: Wget
User-agent: WikioFeedBot
User-agent: WinHTTrack
User-agent: WWW-Collector-E
User-agent: Xenu Link Sleuth/1.3.8
User-agent: xirq
User-agent: yacy
User-agent: YRSPider
User-agent: ZeBot
User-agent: ZeBot_www.ze.bz
User-agent: Zeus
User-agent: Zookabot
Disallow: /
User-agent: *
Crawl-delay: 3 
</apex:page>
 Courtesy : www.adminbooster.com/  Thank you so much AdminBooster for such a informative post.