
- Add database initialization scripts - Add configuration files - Add documentation - Add public assets - Add source code structure - Update README
2.5 KiB
2.5 KiB
Ethical Web Scraping Guidelines
Core Principles
-
Respect Robots.txt
- Always check and honor robots.txt directives
- Cache robots.txt to reduce server load
- Default to conservative behavior when uncertain
-
Proper Identification
- Use clear, identifiable User-Agent strings
- Provide contact information
- Be transparent about your purpose
-
Rate Limiting
- Implement conservative rate limits
- Use exponential backoff for errors
- Distribute requests over time
-
Data Usage
- Only collect publicly available business information
- Respect privacy and data protection laws
- Provide clear opt-out mechanisms
- Keep data accurate and up-to-date
-
Technical Considerations
- Cache results to minimize requests
- Handle errors gracefully
- Monitor and log access patterns
- Use structured data when available
Implementation
- Request Headers
const headers = {
'User-Agent': 'BizSearch/1.0 (+https://bizsearch.com/about)',
'Accept': 'text/html,application/xhtml+xml',
'From': 'contact@bizsearch.com'
};
- Rate Limiting
const rateLimits = {
requestsPerMinute: 10,
requestsPerHour: 100,
requestsPerDomain: 20
};
- Caching
const cacheSettings = {
ttl: 24 * 60 * 60, // 24 hours
maxSize: 1000 // entries
};
Opt-Out Process
-
Business owners can opt-out by:
- Submitting a form on our website
- Emailing opt-out@bizsearch.com
- Adding a meta tag:
<meta name="bizsearch" content="noindex">
-
We honor opt-outs within:
- 24 hours for direct requests
- 72 hours for cached data
Legal Compliance
-
Data Protection
- GDPR compliance for EU businesses
- CCPA compliance for California businesses
- Regular data audits and cleanup
-
Attribution
- Clear source attribution
- Last-updated timestamps
- Data accuracy disclaimers
Best Practices
-
Before Scraping
- Check robots.txt
- Verify site status
- Review terms of service
- Look for API alternatives
-
During Scraping
- Monitor response codes
- Respect server hints
- Implement backoff strategies
- Log access patterns
-
After Scraping
- Verify data accuracy
- Update cache entries
- Clean up old data
- Monitor opt-out requests
Contact
For questions or concerns about our scraping practices:
- Email: ethics@bizsearch.com
- Phone: (555) 123-4567
- Web: https://bizsearch.com/ethics